节点文献
NER通路多个SNP对肺癌易感性的交互作用研究
【作者】 邹莉玲;
【作者基本信息】 复旦大学 , 流行病与卫生统计学, 2009, 博士
【摘要】 研究背景肺癌是当前全球受累人数最多的恶性肿瘤之一,人群流行病学及基础病因学研究已经证实肺癌发生是遗传和环境因素的共同作用。在肺癌发生过程中,人类的防疫机制起着重要保护作用。目前已知人体内至少有130种DNA修复基因。这些DNA修复基因的多态性现象可能通过改变DNA修复能力,从而增加个体患肺癌的风险。NER修复通路就是其中一个重要的DNA修复系统。目前关于NER通路基因多态性与肺癌易感性关系的研究非常热门。然而由于生物学方面的因素,以及研究设计和统计分析方面的因素影响,即使是同类问题的关联性研究,其结果却很不一致。其中,最主要原因是由于样本量所限,不可能对各种可能的关联性分析到位,都是在某种角度上进行分析。目前,分子流行病学界比较关注的是基因与基因对患病的交互作用以及基因与环境对患病的交互作用。用于多基因联合分析的方法除了传统的多元Logistic回归模型外,还有报道多因子降维法(Multifactordimensionality reduction,MDR)、分类回归树(Classification and regression tree,CART)等数据挖掘方法,各种分析方法都存在一定的优势和缺陷,对其结果和作用都存在许多值得讨论和商榷的地方。关联规则分析被认为是可以从大量数据中筛选新颖、潜在未知的知识和信息的一种有效工具,可以为发现各种属性包括属性组合之间的复杂关联提供许多有益的信息,因此我们考虑利用关联规则对于样本量较大的基因多态数据进行筛选,找出比较理想的信息,为进一步分析中的Logistic回归模型提供有效的待选协变量(基因)。研究目的研究NER通路基因单核苷酸多态性(single nucleotide polymorphism,SNP)与肺癌易感性的关系,寻找多个SNP对肺癌的交互作用,以及合适的筛选和分析方法。研究方法基于NER通路SNP数据的特点,拟制定关联规则筛选准则并联合Bootstrap技术,筛选出那些包含肺癌患病可能的交互作用信息的SNP组合,然后用Logistic回归模型做确认性质的回归分析和检验。为了初步证实上述分析方法的有效性,采用小规模的随机模拟方法,设定随机模拟模型的参数,产生与实际资料性质基本相似的模拟数据;采用上述筛选和分析策略以及对应的统计方法对模拟数据进行分析,比较模拟数据分析结果与模拟模型设定的参数的差异,由此验证上述方法可行性以及其他相关方法(MDR)的适用情况。其中SNP数据模拟是根据特定的生物学背景和随机模拟模型,设置模型参数,通过MATLAB7.0软件编程实现,对应的病例对照疾病状态模拟数据根据模拟模型通过SAS9.13软件编程完成。关联规则挖掘采用经典的Apriori算法,选择Lift、Fisher’s确切概率P值、支持度和可信度作为关联规则的客观评价指标,通过这些指标的不同取值制定规则筛选准则,根据各个准则应用于模拟数据的结果对各准则进行评价,从中选择一种最有效的筛选准则用于实际资料分析。关联规则分析的过程采用SAS9.13软件编程实现。评价指标:①筛选准则的评价指标:100个模拟样本的规则集合中包含模拟模型预定变量及交互项的平均频数(Mean of Frequency,MF)、标准误(Standard Error,SE)、MF的95%可信区间(Confidence Interval,CI),以及筛选出的规则总数。②模型评价指标:Logistic回归模型参数估计的偏倚(Bias)、偏倚程度(Degree ofBias,DB)、95%可信区间的覆盖率(Coverage)。研究结果对初步的模拟数据分析发现:关联规则分析确能发现大量数据中各变量之间的可能潜在关联,包括变量间的交互作用。以Lift和Fisher’P值作为关联规则客观评价指标,结合Bootstrap抽样技术制定的规则筛选准则,确实能够有效地筛选出包含模拟模型中预定变量的规则。为了保证规则挖掘的成功率,应当将关联规则参数最小支持度(min_sup)和最小可信度(min_conf)设置的比较低。Bootstrap抽样技术的应用使得关联规则结果更稳定可靠。MDR方法的模拟数据结果提示:应用MDR方法不能得到真正意义上的交互作用,应避免误用。通过对实际资料分析,寻找到与肺癌易感性相关的两个基因多态位点XPG-rs732321和DDB2-rs830083,以及两个交互作用项ERCC1-rs3212930×ERCC1-rs3212951和ERCC2-rs13181×XPG-rs873601。XPG-rs732321的突变基因型(CC+AC)为肺癌的保护基因型(OR=0.54,95%CI=0.35~0.85)。基因位点DDB2-rs830083的突变基因型(GG+CG)为肺癌的危险基因型(OR=1.32,95%CI=1.03~1.70)。ERCC1基因rs3212930、rs3212951两个位点对于肺癌患病具有协同作用(OR=2.75,95%CI=1.1 8~6.64),同时携带这两个位点突变基因型的个体,相比仅有1个位点突变的个体,具有更高的肺癌患病风险。ERCC2-rs13181和XPG-rs873601两个基因位点间也存在交互作用(OR=2.43,95%CI=1.09~5.44),这两个位点同时突变的个体,相比仅有其中1个位点突变个体,以及两个位点都不突变的个体,都具有更高的肺癌患病风险。结论以关联规则客观评价指标支持度、可信度、Lift和Fisher’P值制定的规则筛选准则,联合Bootstrap抽样技术,确实能够有效地筛选到有价值的关联规则,发现数据中各变量之间的潜在关联,包括变量间的交互作用。以筛选到的SNP和SNP组合作为建立疾病相关的多因素Logistic回归模型的待选协变量,在Power较大的情况下,可以有效地找到与疾病相关的SNP和SNP间的交互作用。将本研究方法应用于NER通路SNP实际数据的分析,找到了与肺癌易感性相关的两个SNP位点和两个交互作用项。这些阳性结果均可以从生物遗传学角度得到合理解释。
【Abstract】 BackgroundLung cancer is one of the most serious types of malignant tumors,with a high incidence andmortality rates.Based on epidemiological and population studies,it was confirmed that theetiology of lung cancer is involved with genetic and environmental factors.In the process of lungcancer,the mechanism of human disease prevention plays an important role in protection.It wasknown that the human body has at least 130 kinds of DNA repair genes.These DNA repairgenes polymorphisms may through change the DNA repair capacity,thereby increasing the riskof individuals suffering from lung cancer.NER repair pathway is one of the important DNArepair pathway.Nowadays,the topic of the relationship between NER pathway genepolymorphisms and susceptibility to lung cancer becomes very hot.However,results fromsimilar studies maybe are very inconsistent.The main reason is the limited sample size,it isimpossible to analyze all possible relationship rather than part of it.At present,the molecularepidemiology studies are more concerned about the interactions between genes and genes,andinteractions between genes and environment.In addition to the traditional multiple logisticregression model can be used to analysis multiple SNPs,there are reports multi-factorialdimensionality reduction method (MDR),classification and regression tree (CART) and otherdata mining methods.All of these methods have their own advantages and limitations.There aremany questions worthy of discussion for these methods results and effects.Association rulemining is considered an effective tool in screening novel or unknown knowledge andinformation from a large amount of data,so it can be used to find valuable information aboutvarious relationships between attributes in a large number of SNPs data.This information isuseful to select candidate covariates (genes) into the following Logistic regression model.ObjectiveThis objective is to study the relationship between NER pathway gene polymorphisms andsusceptibility to lung cancer,to find interactions between SNPs related with lung cancersusceptibility,and find the helpful means or method applied in SNPs and disease susceptibilityrelationship analysis.Methods Based on the actual SNP dataset,we used the association rules mining combined Bootstrapmethod to find the association rules between SNPs and lung cancer.To confirm association rulesfindings we made the Logistic regression model based on these rules including candidatecovariates (genes) and interactions information.To preliminary prove our method correct,wecarried out a small scale simulation study,through simulate random model and set modelparameters of a special biological context same with the SNP data.We analyzed the simulationdata by above method and compared the results with other methods.Independent variablessimulation data are generated by MATLAB7.0 software programming based on simulationbiological context.Dependent variable (disease state) simulation data are generated by SASsoftware programming based on the simulation model.The classical Apriori algorithm was used in mining association rules,implemented bySAS9.13 software.We selected the following rule interestingness measurement index:Lift,Fisher’s exact probability,support and confidence.By changing the index values we chosen amost effective criteria to screen association rules from actual data analysis.Methods evaluation index:(1) the selection rules criteria evaluation index:the averagefrequency (MF),standard error (SE),95% confidence interval(CI),and the total number of rulesof the variables and interactions scheduled in simulation model including in the outcome rules.(2) model evaluation index:Logistic regression model parameter estimation bias (Bias),thedegree of bias (DB),95% confidence interval coverage (Coverage).ResultsThrough the small scale simulation study,we found that association rule mining is indeed auseful tool to find the potential association between variables in a large amount of data,includinginteractions between variables.Fisher’s exact probability and lift as rules interestingnessmeasurement index,combined with Bootstrap sampling technique,is indeed able to effectivelyselect rules that include variables in the simulation model.In order to ensure the success rate ofmining,the parameters minimum support (min_sup) and minimum confidence (min_conf)should be set relatively low level.The application of Bootstrap technique in association rulemining is beneficial for getting robust results.Both the simulation study results and methodanalysis of MDR confirmed that the interactions found by MDR are not credible.The actual data analysis results showed that the following SNPs and interactions related withlung cancer susceptibility:XPG-rs732321,DDB2-rs830083,ERCCl-rs3212930×ERCC1- rs3212951 and ERCC2-rs13181×XPG-rs873601.XPG-rs732321 (CC + AC) is the protectiongenotype for lung cancer (OR= 0.54,95% CI = 0.35~0.85).DDB2-rs830083 (GG + CG) willincrease the risk of lung cancer (OR=1.32,95% CI=1.03~1.70).ERCCl-rs3212930 and ERCC1-rs3212951 have synergistic effect of lung cancer risk (OR=2.75,95% CI = 1.18~6.64).Individual with the two mutation loci,compared with individual carrying one of the twomutation site,has a higher risk of lung cancer.The interaction between ERCC2-rs13181 andXPG-rs873601 (OR = 2.43,95% CI = 1.09~5.44) exists..Individual with the two mutation sites,compared with that carrying only one mutation site,or none of the two sites mutation,has ahigher risk of lung cancer.ConclusionAssociation rule mining is useful to find the potential association including interactionsbetween variables in data,through rules measurement index:support,confidence,lift and Fisher’exact probability,and Bootstrap technique.The SNPs and SNPs alliances included in rules canbe used as candidate covariates (genes) and interactions into multi-logistic regression model ofdisease and SNPs.If the power is large enough,our method is indeed able to find the SNPs andinteractions related with lung cancer.In this research,we found two lung cancer susceptibilitySNPs and two interactions.All of these positive finds can be explanted reasonable frombiological perspective.