节点文献

对于系统发育谱法聚类算法的改进

An Improvement of Cluster on Phylogenetic Profiling Method

【作者】 李东晗

【导师】 马志强;

【作者基本信息】 东北师范大学 , 计算机应用科学, 2011, 硕士

【摘要】 随着高效、自动化的测序技术的出现,生物信息学的中心课题,已经从对基因的测序,转移至对已测序基因的分析,主要是对基因功能的研究及注释。由于同源性方法的自身缺陷问题及精度问题,人们开始逐渐重视非同源性方法。非同源性方法主要是通过序列的属性对归类,进而进行功能预测。系统发育谱法在众多非同源性方法应用中应用最为广泛。系统发育谱法于1999年由Pellegrini提出,随后众多学者从基因参照组选择、系统发育谱构建、谱相似性分析这三方面对其改进。本文在这些基础之上,先构建基于权重的系统发育谱,之后交替使用层次聚类法与K均值聚类法进行相似性分析。在谱相似性分析阶段,提出两点改进:一是提出一种新的距离,用于层次聚类法的聚类阶段。二是从层次聚类法中提取更多信息,为K均值聚类法提供初始信息,更充分的利用层次聚类法的结果,使得K均值聚类法的结果更准确。目前在聚类算法中,主要应用的是欧式距离。因为我们所处理的样本大都属于欧式空间,所以采用欧式距离聚类可以得到不错的效果。本文所采用的距离,是一种非欧空间距离。相比欧式距离,它强化了已知信息对样本距离的影响。它不仅考虑样本之间的距离,还考量了样本与参照系样本的距离。使用这种新的距离,可以使我们优先处理与已知参照系相近的样本。K均值聚类法的缺陷在于初始条件选取的敏感性:初始聚类数K与初始聚类目标的选取,会对最后的聚类结果产生很大影响。目前对K均值算法的改进主要在初始信息的选取上。前人采用层次聚类与K均值聚类结合使用的方法,目的是利用层次法为K均值聚类法提供初始聚类数K。本文在此基础上,从层次聚类法的结果中提取更多有用信息,给出K均值聚类法的初始聚类目标。最后,本文用Escherichia coli K12基因组作为试验样本,对这些改进进行试验验证。由试验结果可知,相比与原先的结果,新的算法准确性更高。

【Abstract】 With appearance of the automatic, efficient sequencing technique, the task of Bioinformatics has transferred to the gene analysis and genome donation. Because of the shortcomings of the homology method, people pay more and more attention to the non-homology ways. The classification and function analysis is based on the attribute of the sequence.Phylogenetic profiling is a kind of non-homology annotation method using evolution information. After it was proposed by Pellegrini in 1999, many researchers had improved it from reference genome selection, phylogenetic profiling foundation and profile’s similarity analysis. Phylogenetic profiling has three forms: discrete, continuous and weight-based. Weight–based type is developed from continuous one. It can mark the gene which has good performance in the sample protein more prominent, and the gene which is seldom translate in the sample will also be weaken by weight accordingly. In this paper we use this type of phylogenetic profiling method to pre-process the protein data, then the hierarchical cluster and K-means cluster are used together. Two improvements are made upon predecessor’s work: First, a kind of distance based on Bioinformatics background is used in hierarchical cluster. Second, abstract more information from hierarchical cluster result as the initial parameter of K-means cluster. It will make the K-means cluster more efficient.Most distance we adapt in cluster algorithm are Euclidean distance. Because most of the samples we deal with are in Euclidean space, the cluster result perform well. The distance we adapt in hierarchical cluster is a new type, which belongs to non-Euclidean space. Compare with the Euclidean distance, this kind of distance strengthen the already-known information. Not only the distance between two samples is taken in consideration, the distance between samples and the reference subject is also took in, which ensure us to deal the samples similar to the reference group.The shortcomings of the K-means is that initial parameters has strong impact on the result. Currently most improvement are mainly focus on the choice of the initial parameters. The purpose of using hierarchical cluster and K-means cluster together, is that provide K-means cluster initial cluster number K from hierarchical cluster. We abstract more information from hierarchical cluster result to provide K-means cluster initial point.Finally, Escherichia coli K12 genome is chose as experiment sample to verify the improvement. As we can see from the result, compare with the traditional one,new algorithm has more accuracy and more efficient.

  • 【分类号】TP311.13
  • 【下载频次】24
节点文献中: 

本文链接的文献网络图示:

本文的引文网络