节点文献

聚类算法模型的研究及应用

Researches on Mathematical Models for Data Clustering and Applications

【作者】 陈树

【导师】 徐保国;

【作者基本信息】 江南大学 , 轻工信息技术与工程, 2007, 博士

【摘要】 生物信息学是计算分子生物学与计算机科学之间的交叉学科。近年来,随着数据挖掘技术发展,生物技术正给整个人类带来前所未有的巨大变化。本文围绕聚类模型及其在生物信息中的应用开展研究,主要内容、贡献和创新包括:(1)微阵列中稠密区的研究稠密区是一个有统计意义的数据模式集合,它能用来标识基因模式和相关样本集合,也可以消除孤立点、噪声及非正常模式有关的基因模式。通过对稠密区特性的研究,可以依据稠密区的特征对其划分为几个不同的类别,进而给出对不同类别的相应算法。进一步对不同大小稠密区分布的研究,进而评测稠密区的生物意义。在具体应用中,采用两个实际的数据集来测试该算法,第一个数据集来自30个批次小试生产β-甘露聚糖酶样本的数据模式,样本浓度由高到低,该算法有效的标识出浓度不同数据模式。第二个数据集来自酵母菌数据集的基因表达,该算法同样有效的标识出表达相似的基因。同时,将该算法和另外四个常用的聚类算法进行对比,在同一合成数据集聚类效果的对比表明该算法有优越的性能。(2)基因网络模块的探测基因和蛋白质交互网络的生物研究表明这些网络由模块组成。模块的识别是理解整个网络结构的关键一步,为此,将结点不相似的测量方法与聚类算法结合起来,从而给出模块的识别方法,更进一步,在拓扑覆盖矩阵的基础上(该方法在很多生物上的应用上得到证实),采用一个结点不相似测量的通用方法,它综合标准的拓扑覆盖矩阵法,并在此基础上与双向层次聚类算法相结合。它主要用于网络模块的标识,也可用于结点连接的度量,它优于基于结点度的连接方法,在分析该算法相关特性的基础上,给出了它在基因表达网络中的适用的原因。最后,通过应用表明,标准的拓扑覆盖矩阵适用于发现较小模块,而采用推广的拓扑覆盖矩阵结合双向层次聚类算法则更适用于发现较大的模块。(3)基于随机投影集合的高维数据聚类研究针对高维数据聚类中如何产生多个低维的基聚类和如何对这些低维的聚类集合进行组合的问题,采用随机投影和双向图划分法,特别在基聚类集合中,采用一种新的基于OPTOC聚类算法。通过在八个数据集上评测集合构造器,结果表明:随机投影生成的集合性能优于其它两个集合构造器生成的集合。通过对四个不同的共识函数在用两种不同的类型的集合上的评测,结果表明:两个基于图形的划分方法性能优于另外两种方法,其中双向图划分法对两个集合基聚类的改善率比较高。(4)基于尺度聚类的研究基于尺度聚类模型的特点在于允许用户直接动态的控制聚类的尺度,即用户从不同的尺度观测数据集,就能得到相应尺度的聚类,且这种尺度是数据集所固有的。它引入聚类的同源算法和分离算法构建目标函数,特别在同源和分离识别方面,采用Renyi熵来表示类内相似度和类间分离度,用尺度参数控制聚类的尺度。在数据集中,用Pearson相关系数作为对象之间相似度的测量,该算法的时间复杂度低于典型的层次聚类和划分聚类算法。该模型在对生物信息、图像的数据集聚类过程中显示它的良好的效果。最后,在总全文进行了总结,提出了有待进一步研究的课题和今后研究工作的重点。

【Abstract】 Bioinformatics is the interdiscipline of computational molecular biology and computer science.With the rapid development of data mining,biological technologies are reshaping the human society.This paper studies mathematical models for data clustering and applications for bioinformatics.The main content,contribution and innovation in the paper are described below:(1) The research of dense regions in microarray data.A dense region is data subset of statistical significance. It can identify similar subgroup of genes or samples,on the other hand,it can also get rid of outlier and abnormal data. After studying properties of dense regions,we classify dense regions according to their properties and then give the corresponding algorithms.We can also find biological meaning based on their distribution property. In the two experiments, the first dataset contains data of 30 beta-mannanase samples. Beta-mannanase’s products concentration is from low to high. The algorithm can identify the subsets of samples and data patterns simultaneously.The second datast is the microarray of gene expression during yeast’s cell cycles.The algorithm also can work well.After the comparsion with four other clustering algorithms for two synthetic datasets,it demonstrates its usefulness.(2) Detecing modules in gene networks.Genes and their protein products carry out celluar processes in the context of functional modules.Thus it’s critical to identify these modules in order to know the gene network structure.We combine unsimilar measurement with clustering method,putting a new way to identify modules.furthermore,based on topological overlap matrix(the method has been verified in many biological applications),we come up with a new generalized method combined with bidirectional hierarchical clustering.It mainly apply in detecting modules、measurement of nodes.It performs better than other measurements between nodes.Meantime,we give a proof of its applications.At last ,normal topological matrix can find small modules,whereas this bidirectional hierarchical clustering based on generalized topogical matrix can work well in find large modules in the gene networks..(3) Ensemble methods for high dimensional clustering based on random projection.We explore how to employ ensemble methods to solve high dimensional data clustering problems.I investigate three different approaches to constructing ensembles based on randomized dimension reduction,particularly,we employ a new clustering method based on OPTOC algorithm.The results demonstrate the random projection is an effective approach for generating cluster ensembles for high dimensional data and that its efficacy is attributable to its ability to produce diverse base clusterings,then I employ a graph based approach which tansforms the problem of combining clusterings into a bipartite graph.Comparisons of the bipartite approach to three existing approaches illustrate that the bipartite approach achieves the best overall performances.(4) A scale dependent model for clusteringWe employ a model for clustering a set of high dimensional data into subsets of homogenous clusters which are well separated by each other. A novel feature of this model is that it allows the user to directly control the scale of the clusters.This is realized by formulating the clustering problem as an optimization problem. We study some properties of homogeneity and separation defined based on pair-wise measured by the Pearson correlation coefficient,particularly,we use Renyi’s Entropy to represent the index of homogeneity and separation. In this case,for a dataset, the performance of the algorithm is better than typical hierarchical and partitional algorithms Experimental results on synthetic,biological and iamge data demonstrate the usefulness of proposed model.Finally, there are concluded with a summary and some problems needed to be studied in future are put forward.

  • 【网络出版投稿人】 江南大学
  • 【网络出版年期】2009年 03期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络