节点文献

基于密度的并行聚类算法研究

Studying on Parallel Clustering Algorithms Based on the Density

【作者】 毛韶阳

【导师】 李肯立;

【作者基本信息】 湖南大学 , 计算机应用技术, 2007, 硕士

【摘要】 随着现代生物技术的不断发展特别是基因组计划的实施,人们不断的获得大量基因序列数据,互联网上的基因数据正呈指数增长,这些内涵丰富的数据为人们分析和研究基因的组成与功能之间的关系提供了基础。现代信息技术的发展尤其是超级计算机的飞速发展所带来的高速计算能力正引导着算法研究者们不断研究出新的并行聚类算法,以解决高维海量基因序列数据的计算问题。大量事实说明,一个准确、高效的并行聚类算法对生物计算尤其是基因序列数据计算的影响力是不可估量的。本文首先对目前的几种典型的串行聚类算法就适用数据属性范围、时间复杂度等方面进行了分析,提出了对基因序列数据采用基于密度聚类的观点,提出了一种和基因序列数据相匹配的密度函数计算方法及一个相适应的邻域半径计算公式。通过对并行计算模型的研究,设计了一种基于密度的并行聚类算法,通过3次时间复杂度为O(n~2/P)的并行运算,能使并行聚类过程的时间复杂度变为O(n/P)。比较传统的基于密度的聚法算法而言,增加了一次计算,以增加一次计算为代价来减少计算机操作上的开销。最后在计算机群上对本文所提算法进行了验证,实验结果表明:此算法对高维海量基因序列数据有着很好的聚类效果,簇内数据收敛度高,展示了良好的时间优越性。

【Abstract】 With the continuous development of modern biology technology, especially the implement of the Human Genome Project, people have acquired quantities of gene sequence data, the gene’s data in the Internet is presenting exponential increase, which supplies basis for people’s analyzing and research the relationship of gene’s composing and functions. The development of the modern information technical especially the super computer has brought high-speed compute ability, it can guide the researchers to find new clustering algorithm for the high dimension thousands gene sequence data analyses. Lots of experiments show that an accurate and efficient parallel algorithm is impossible to estimate the influence to the biology compute especially to the gene sequence data.Firstly, this paper analyzed some typical serial clustering algorithm about data property and the time complexity, put forward a point that the gene sequence data clustering may base on the density, and propose a method of computing about the density function and the neighbor area radius.Secondly, this paper studied the parallel algorithm model, and designed a parallel clustering algorithm based on the density, it can make the parallel clustering time complexity into O(n/P) through three time computing with the time complexity of O(n~2/P) . Compare to the traditional clustering algorithm, it added once compute. Take adding once calculation as price to reduce the processing expense.Finally, validated this algorithm on computer clusters, the experiment show that the parallel clustering algorithm has efficient cluster ability on the high dimension thousands gene sequence data, has strong convergence in one cluster and displaying good time superiority.

  • 【网络出版投稿人】 湖南大学
  • 【网络出版年期】2007年 05期
  • 【分类号】TP301.6
  • 【被引频次】1
  • 【下载频次】193
节点文献中: 

本文链接的文献网络图示:

本文的引文网络