节点文献

面向生物数据的关联规则挖掘算法及其应用研究

Research on Mining Algorithm of Association Rule and Its Application for Biological Data

【作者】 马猛

【导师】 王煦法;

【作者基本信息】 中国科学技术大学 , 计算机应用技术, 2008, 博士

【摘要】 随着基因组和蛋白质组研究的进展,以及现代生物技术的快速发展,由高通量技术产生了海量生物数据,这为揭开生命奥秘提供了数据基础。生物数据种类丰富,高通量,维数高,具有异构易变的特性,远远超出传统的分析方法的能力,生物数据的分析成为当今生物学研究的瓶颈,对其处理、挖掘、分析和理解的要求日益迫切。目前生物数据分析中存在着一些问题,例如,数据分析采用的算法模型有越来越复杂的趋势,被用于数据分析的黑盒算法获得的分析结果难以作出生物解释等。而生物信息学研究的根本目的就是利用生物数据,解释生命现象,发掘生命规律。关联规则是一种重要的数据挖掘技术,利用该技术从生物数据中挖掘获得的模式即具有生物学上的意义(重要性),又具有数学上的重要性(可发现性),且结构透明,具有良好的可解释性。本文主要对面向生物数据的关联规则挖掘算法及其应用进行了研究,其主要研究内容包括:(1)多相关关联规则挖掘算法及其应用研究生物数据中蕴含着丰富的内涵,仅利用传统的关联规则挖掘,一些有意义的模式会被丢失而无法获得,为此,本文提出了一种新形式的关联规则一多相关关联规则,在给出多相关关联规则形式化定义的基础上,对有用多相关关联规则的挖掘准则进行了研究,并给出了一个挖掘算法,并且利用多相关关联规则对蛋白质结构数据进行了分析,从中得到了很多有用的规则,在其它两个数据集上也进行了实验,得到了一些新颖的知识。(2)利用定量关联规则分析蛋白质结构数据的研究1961年Anfinsen提出蛋白质分子的一级序列完全决定其空间结构的论断,对于这个假定,我们需要分析如下几个问题:不同的氨基酸对不同的蛋白质空间结构形成是否具有不同的倾向性?蛋白质的氨基酸序列是否是随机的?序列中是否存在着一些氨基酸共生模式?这些模式是否对不同空间结构的形成具有不同的倾向性?目前开展的大部分研究是基于氨基酸序列预测蛋白质各位点的空间结构,主要是定性研究,利用定量方法分析不同氨基酸对形成不同蛋白质结构的倾向性的研究却较少,本文提出利用定量关联规则分析蛋白质的氨基酸构成和蛋白质结构形成间的关联关系,获得了很多有用的规则,这些规则对人工合成蛋白质分子具有参考价值。(3)聚类和关联规则挖掘在基因表达数据分析中的应用研究由于基因表达数据具有高维低样本的特点,直接对基因表达数据进行关联规则挖掘,实际上是不可行的。为此,本文将聚类和关联规则挖掘相结合,首先对基因表达数据进行聚类分析,得到若干基因簇,实现了分析数据的降维,然后对每个基因簇中的表达数据进行离散化,将每个基因离散化为7个项目,然后进行关联规则挖掘,得到了大量的关联规则,得到的这些关联规则不仅提供了基因之间的调控方向,而且还提供了基因之间调控强度的信息。(4)从肿瘤基因表达数据挖掘分类规则的研究基于关联规则的分类研究是关联规则挖掘研究的一个热点,目前这方面也已经开展了大量的研究工作。由于肿瘤基因表达数据中的样本具有高维低样本的特点,所以很难直接应用传统的关联规则挖掘算法构建分类器,因此本文提出了一种直接从肿瘤基因表达数据挖掘分类规则的方法,这种方法首先从数据中抽取分类特征,然后基于分类特征产生分类规则,基于这些分类规则按照置信度最高的原则进行样本类别预测,实验表明,该方法不仅具有良好的预测精度,并且相对于黑盒算法来说,具有良好的可解释性。

【Abstract】 With the quick development of the research of Genomics and Proteomics, at the same time, more advanced biology technology invented, huge amount of biological dataum are accumulated, which provide the data basis for uncovering the nature of life. The biological dataum have many its own features, which consists of plenty of categories, high-throughput and high dimension. All these features make it very diffcult to analyze these biological dataum because it far beyonds the capalicity of the traditional statistical analysizing methods. Analyzing biological dataum has become the bottleneck of biological research. The requirements of processing, mining, analyzing and understanding biological dataum become increasingly urgent.Some problems are with the research of analyzing biological dataum currently. For example, A trend appears that more and more complicated algorithms and models are adopted when analyzing biological dataum.Also, it is hard to interpret the analyzing results biologically from the black box algorithms. While the aim of bioinformatics research is to interpret biological phenomena and dig out the nature of life based on the biological dataum, accordingly, more appropriate analyzing algorithms are needed to analyze biological dataum.Association rule is an important data mining technology. Using such technology, some patterns can be finded form biological data that is significant biologically and mathematically. In this dissertation, the theoretics and application of the algorithm of association rule for analyzing biological dataum are studied. The main content in this dissertation are described below.(1)The study of the algorithm for mining multi-association rules and its applicationBiological data contains abundant connotation, lots of which can’t be mined using traditional associaiton rule algorithm. In order to mine more knowledge form biological data, a new form of association rule, multi-association rule, is presented in this dissertation. This dissertation presents the formal definition of the multi-association rule, the mining guid lines for useful multi-association rule and an algotrithm for mining multi-association rule. Applying this algorithm to mine three datset and many useful rules obtained.(2)The study of analyzing protein sturcture data using quantitative association ruleIn 1961, Anfinsen presented such assumption that the amino acid sequences of protein molecule totally determine its spacial structure. To validate such assumption, we can divide it to the following problems: Are the amino acid sequences of protein random? Does different type of amino acid have different orientation for developing different protein spacial structure? Do the occurring patterns exist in the amino acid sequences? Do these patterns have different orientation for developing protein spacial structure? Most current research mainly focuse how to predict protein spacial structure in each site based on the amino acid sequences, which is qualitative analysis. Few research is about the orientation of every teype of amino acid for developing different protein spacial structure using quantitative analysis methods. This dissertation analyzes the association relationship of the amino acid ingredient in protein and the protein spacial structure using quantitative association rule. Many interesting association rules obtained through experiment. Such rules obtained here can hold the potential to give clues regarding the global interactions amongst some particular sets of amino acids occurring in protein and the guiding information containing in the amino acids sequences for the development of the structure of the protein. These rules will prove very important in the design and synthesis of artificial peptides outside the cell.(3)The study of application of clustering and association rule mining to analyzing gene expression dataBecause of the high dimension and small sample set of gene expression data, it is impossible practically to mine gene expression data using association rule mining algorithm directly. According, this dissertation incoporate the clustering and association rule mining to analyze the gene expression data. Firstly using clustering menthod to get some gene clusters, and then discritize each gene to seven items, at last, we can get many rules from every gene cluters using association rule mining algorithm. These rules can give not only the information about gene regulation direction but also that about gene regulation strength.(4)The study of mining classifying rule from tumoral gene expression dataClassification based on association rule is a useful predictive technology. Because the gene expression data has high dimension but small sample set, it is hard to construct classifier using traditional association rule mining method based on such data. Hence, this dissertation provide a new method that directly mine classifying rules from gene expression data and construct clsssifier using these classifying rules. The experiment results show that this method has a high predictive accurency and is easy to interpret biologically.

  • 【分类号】TP311.13
  • 【被引频次】6
  • 【下载频次】893
节点文献中: 

本文链接的文献网络图示:

本文的引文网络