节点文献

基于关联规则的基因芯片数据挖掘与应用

Association Rules Mining and Its Applications in Microarray Gene Expression Data

【作者】 彭斌

【导师】 易东;

【作者基本信息】 第三军医大学 , 流行病与卫生统计学, 2008, 博士

【摘要】 人类基因组草图(Human Genome Draft,HGD)的绘制完成标志着现代生命科学研究从基因组时代进入了后基因组时代,研究的重心由结构基因组学转向功能基因组学,基因彼此之间的相互作用、相互影响越来越多地受到研究者的关注。基因芯片作为一种高通量的检测技术,可以同时检测成千上万条基因的表达水平,成为研究基因与基因之间相互作用关系的强大工具。随着基因芯片大量数据的产生,数据挖掘成为从基因芯片表达数据中解读基因相关信息的重要技术手段。本研究针对目前关联规则挖掘技术用于基因芯片表达数据分析时存在的问题,从三个方面进行了比较全面和深入的研究:时序基因芯片表达数据的跨事务关联规则挖掘、传统关联规则中基因表达状态信息缺失问题及大量关联规则的聚类问题等。本文的主要内容及贡献包括:(1)时序基因芯片表达数据中的跨事务关联规则挖掘研究为了解决传统关联规则忽视数据中的时间信息以及无法对基因的表达状态进行动态预测的问题,本研究提出将跨事务关联规则挖掘技术引入到时序基因芯片表达数据的分析之中,并对跨事务关联规则进行了详细介绍。结合生物学数据库,包括Gene Ontology基因注释数据库、iHOP数据库、DAVID生物信息学资源数据库等,对挖掘出来的跨事务关联规则进行分析,结果显示跨事务关联规则能够有效地挖掘时序基因芯片表达数据中的隐含信息,产生的关联规则符合生物学背景,合理地描述基因之间的动态表达行为。因此,跨事务关联规则为基因功能的预测提供了新的手段和方法。(2)传统关联规则中基因表达状态信息缺失问题研究通过对传统关联规则中基因表达状态信息缺失这一问题的深入分析,本研究设计了一种新型的关联规则类型——差异表达关联规则(Differential Expression Association Rules,DEAR),并给出了基本定义及相关概念。为了能够有效地挖掘差异表达关联规则,本文提出了一种算法——差异表达关联规则矩阵算法(Differential Expression Association RulesMatrix Algorithm,DEARM算法),并对进行了详细地阐述。实验结果表明,差异表达关联规则在发现基因表达模式及控制冗余规则产生方面要优于传统关联规则。差异表达关联规则作为一种新的关联规则类型,是对关联规则挖掘内容的丰富,将有助于研究人员从基因芯片表达数据中揭示基因之间隐含的表达关系。(3)大量关联规则的聚类研究关联规则挖掘通常会推导出大量的规则,这给后期的分析与利用带来了巨大的障碍。本研究针对这一现实问题,提出了采用聚类分析对关联规则进行后期处理。为了更有效地对关联规则进行聚类,本文提出了新的关联规则相似性度量方法——内容结构加权度量,从关联规则的结构及内容上全面反映关联规则的相似性,克服了已有度量方法的缺陷只注重内容方面的缺陷。本文将聚类结果与生物学数据库Gene Ontology相结合进行分析,从生物学的角度说明了同一个子类中的关联规则所涉及的基因有着相似或者相关的生物学基础,体现了聚类在关联规则后期分析处理中的价值。因此,聚类分析将为研究才从关联规则中发现感兴趣的模式提供重要的、可视化的技术手段。

【Abstract】 The completion of human genome draft (HGD) shows that modern life science research has entered the post-genomic era, the research focus has shifted from structural genomics to functional genomics, and strong interest has arisen regarding the elucidation of interactions between genes. The DNA microarray, a high-throughput method, is able to routinely measure the expression levels of hundreds of thousands of genes simultaneously, so it’s a powerful tool to find the relations among genes. Due to its high-throughput experimental data, data mining technique has become an important method to extract useful information from them.To address the problem of association rule mining in microarray gene expression data, this dissertation thoroughly studied the following three aspects: the mining of inter-transaction association rules from time series microarray data, the problem of the absence of gene expression status information in traditional association rules, and the clustering of association rules. The main contributions of this dissertation are summarized as follows:(1) The study of the mining of inter-transaction association rules from time series microarray dataDue to the ignoring of temporal information in time series microarray data, the traditional association rules only reflect the relations among genes at the same time point, and they fail to present the dynamic relations. So we proposed to mine the inter-transaction association rules from such data, and inter-transaction association rules was introduced in details. Some biological information databases, such as gene ontology (GO), iHOP (Information Hyperlinked over Proteins) and DAVID (The Database for Annotation, Visualization and Integrated Discovery), were used to help understanding the inter-transaction association rules. Results show that the rules can extract efficiently hidden information from time series microarray data, and the rules describing the behaviors of genes over times are in accordance with biological background. Therefore, the inter-transaction association rule can be used as a new approach to predict the functions of genes.(2) The study of the absence of gene expression status information in traditional association rulesBy analyzing deeply the problem of the absence of gene expression status information in traditional association rules, we proposed a new type of association rules, differential expression association rules (DEAR), and their definition and relative concept were introduced. In order to mine DEAR efficiently, differential expression association rules matrix algorithm (DEARM algorithm) was proposed, and a detailed description was given. Experimental results indicate that DEAR has better performance than traditional association rules on extracting gene expression patterns and controlling redundant rules. DEAR as a new type of association rules enriches the association rules mining technique, which will help researcher to reveal the hidden interactions among genes from microarray data.(3) The study of the clustering of association rulesA large number of association rules are usually discovered from microarray data, and it is difficult to analyze and utilize them. For the sake of tackling this problem, we proposed to cluster association rules. In this paper, we proposed a new similarity metric to cluster association rules efficiently, which measures the similarity between both the structure and the contents of two rules. Hence it overcomes the drawback of traditional similarity metrics focusing only on contents. By analyzing intensively the sub-cluster of association rules together with the Gene Ontology (GO) annotation database, we found that the genes consisting of association rules in the same sub-cluster have similar or relevant biological background, indicating the value of clustering for association rules. Accordingly, clustering is an important visual technique for association rules mining to find hidden interesting patterns.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络