节点文献

基于粒子对和极值优化的基因聚类混合算法研究

Research on Gene Cluster Hybrid Algorithm Based on Particle-Pair and Extremal Optimization

【作者】 禤浚波

【导师】 张超英;

【作者基本信息】 广西师范大学 , 计算机软件与理论, 2011, 硕士

【摘要】 随着人类基因组计划的完成,生命科学的研究进入到后基因组时代,研究的重点已变为确定每条基因在生物体中的功能以及基因之间相互作用和调控的关系。作为后基因组时代功能基因组研究最基本的实验手段,基因芯片一次实验可以同时观测成千上万条基因在不同实验条件下的表达情况,从而产生了大量蕴含着基因活动信息的基因表达数据。如何分析和处理这些基因表达数据,以提取出对人类有意义的生物、医学信息,已成为后基因组时代人们关注和研究的热点。目前,聚类方法是对基因表达数据进行分析和处理的主要计算技术之一。通过对基因表达数据进行聚类,能够将表达模式相似或相同的基因归纳成类,有助于对基因功能、基因调控、细胞过程、细胞亚型等进行综合的研究,在补充未知基因的生物学功能注释、临床诊断治疗等方面具有重要的现实意义。因此,已有大量国内外学者提出了应用到基因表达数据聚类分析中的各种聚类算法。作为一种较新颖的基因聚类算法,粒子对算法(PPO)在一些基因表达数据集中获得了较好的聚类效果,但也存在着一些有待解决的问题。本文就是围绕着如何进一步提高PPO算法的聚类效果开展研究,主要做的相关研究工作如下:(1)对生物信息学的相关基础知识进行了简单介绍,接着对基因表达数据的获得、表示、预处理、聚类分析原理和聚类结果评价进行了较为详细的分析,最后获取了本文进行聚类分析实验所用到的两组基因表达数据集。(2)对K-means、层次聚类这两种传统的基因聚类算法的原理进行了简单分析,接着介绍了标准粒子群优化算法(PSO)的原理,并分析了粒子群聚类算法的原理和优缺点,最后对基本PPO算法的原理、聚类流程和特点进行了较为详细的阐述。(3)对基本PPO算法进行了较为深入的研究,分析了PPO算法存在着有待解决的3个问题,并相应提出了3种改进思路:用K-means快速聚类结果初始化一个粒子、为初始粒子对之间引入一种最优信息共享策略、根据粒子对的统计信息对属于不同类别的粒子采用不同的速度进化公式,由此得到了一种新的改进粒子对算法ImPPO。为验证改进思路和改进算法ImPPO的有效性,采用了三个基因表达数据集进行聚类分析实验。实验结果表明,与K-means、基本PPO算法相比,提出的改进思路和改进算法ImPPO在一些基因表达数据集中获得了较好的聚类效果,并且再一次说明了对于不同的聚类算法,甚至同一聚类算法使用不同的参数,应用到同一基因表达数据集中可能会得到不同的聚类结果。(4)在对基本极值优化算法(EO)的原理、特点进行分析的基础上,结合PPO和EO算法的优点,提出了一种新的基因聚类混合算法PPO-EO。混合算法PPO-EO在精英粒子对的迭代过程中根据一定的迭代次数将EO算法引入到PPO算法中,一方面利用EO算法强大的局部搜索能力的优点克服PPO算法后期可能过早陷入局部最优的缺点,另一方面利用PPO算法能够保证全局收敛的优点克服EO算法不能保证收敛的缺点,发挥二者的优势完成基因聚类,以提高基因聚类结果的精度。为评价混合算法的聚类效果,通过采用另外三个基因表达数据集进行了聚类分析实验。实验结果表明,混合算法PPO-EO在三个聚类评价指标均方差函数、类内紧致性和类间分离度方面获得了比K-means、PPO算法更好的聚类结果精度。

【Abstract】 With the accomplishment of human genome project, life science research enters into the post-genome era, and the key point of research has changed into determining each gene’s function in organism as well as the relations of interaction and regulation among genes. As the most basic experiment method of function genome study in the post-genome era,gene-chip each experiment can simultaneously monitor the expression of thousands of genes under different experimental condition,which thus has generated massive gene expression data containing gene activity information. How to analyze and handle these gene expression data, digging out valuable biology and medical information towards human,has become a hotspot of concern and study in the post-genome era. At present,cluster method is one of the major computation technologies for analyzing and handing gene expression data. Clustering the gene expression data can classify the genes of similar or same expression patterns into a category,which is helpful to synthetic research on gene function,gene regulation,cell process and cell hypotype and so on,and has vital practical significance in suppling unknown genes’biology function annotation,clinical diagnosis treatment and so on. Therefore,massive domestic and foreign scholars have proposed all kinds of clustering algorithm applied to gene exression data clusting analysis. As one kind of novel gene clustering algorithm,Particle-Pair Optimization(PPO) has obtained better clustering effect in some gene expression data sets,but it also has some questions to be solved. This article carries on research around how to further enhance the cluster effect of PPO algorithm,and main related research work is as follows:(1) We simply introduced the related elementary knowledge of Bioinformatics,then detaily analyzed the obtaining,expression,preprocessing,cluster analysis principle and cluster result validation of gene expression data,finally gained two group of gene expression data sets which were used to cluster analysis experiment in this article.(2) We simply analyzed the principle of two traditional cluster algorithms that K-means and hierarchical clustering,then introduced the principle of standard particle swarm optimization algorithm(PSO) and analyzed the principle,merit and drawback of particle swarm cluster algorithm,finally detaily elaborated the principle,cluster flow and characteristic of PPO algorithm.(3) Based on studying the basic PPO algorithm thorougher, we analyzed three questions to be solved for the PPO algorithm,and proposed three corresponding improved threads:initializing a particle with the fast cluser result of K-means,introducing a sharing strategy of best information between the initial particle-pair,using different velocity evolution formula for the particle that belongs to different category according to the statistical information of particle-pair,which formed an improved particle-pair algorithms called ImPPO. Finally,in order to validate the effectiveness of improved schemes and ImPPO,we used three gene expression data sets to do the experiments of cluster analyzing. The experimental results indicated that improved schemes and ImPPO had better cluster effect than K-means and basic PPO algorithm in som gene expression data sets,and again showed that different cluster algorithm,even same cluster algorithm with different parameter might produce different cluster result in the same gene expression data sets.(4) Based on analyzing the principle and characteristic of basic Extremal Optimization(EO) algorithm,we proposed a new gene cluster hybrid algorithm called PPO-EO by combining the merits of PPO and EO algorithm.The extremal optimization algorithm was introduced in the iteration process of elitist particle pair according to interval iteration. On the one hand,PPO-EO used the merit of EO with power local search capability to avoid PPO falling into local optimization premature in the later period,on the other hand,it used the merit of PPO with ensuring convergence to overcome the drawback of EO without ensuring convergence. PPO-EO completed gene cluster by performing the merits of PPO and EO,which could improve the cluster precision. Finally, in order to evaluate the effectiveness of hybrid algorithm,we used the other three gene expression data sets to do the experiments of cluster analyzing. The experiment results indicated that hybrid algorithm was better precision than K-means and PPO algorithm in three cluster evaluating indicators that mean square error,homogeneity and separation.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络