节点文献

基于智能算法的DNA聚类研究及应用

Research and Application of DNA Clustering Algorithm Based on Intelligent Algorithm

【作者】 张丽

【导师】 李章泉;

【作者基本信息】 山东师范大学 , 管理科学与工程, 2010, 硕士

【摘要】 随着现代生物技术的不断发展特别是人类基因组计划的实施,人们不断获取大量的基因序列数据。面对如此大量的基因序列数据,只有很少一部分基因我们己经知道它们的功能,而大部分基因的功能还是未知的。数据挖掘中的聚类技术正是能够对大量基因数据进行分析的技术。通过聚类技术将这些基因序列进行聚类,得到一些聚在一起的类。由于同一类中的基因序列具有相似的功能,这样我们就可以利用同一类中己知功能的基因推测同一类中未知功能基因的功能。目前生物信息领域的研究中,聚类分析已经得到了广泛的应用。其中生物序列聚类的关键问题就是如何刻画序列间的相似性。而生物序列数据本身的线性排列表示有时难以体现序列间的相似程度,使得在某些情况下,一些相似性度量失效,从而影响了聚类结果的质量。所以如果完全从序列本身出发设计相似性度量,将不能得到符合真实生物学观测的聚类结果,为DNA序列的进化研究带来了一定的困难。伴随着DNA序列图形表达的研究的深入,Randic等人首先提出利用DNA序列的图形表达来研究序列的聚类问题的思想。本文利用这种思想,借助DNA序列的图形表达所抽取的数学特征对序列进行聚类。本文参考已有的基于碱基对称性的DNA序列的二维图形表达方法,做了相应的改进,提出一种新的图形表达的方法。使得改进后的图形表达方法更加节省空间,而且能够更加清楚的体现出DNA序列的生物学特征。利用这种方法,把每个DNA序列按照三组映射规则,转化成三条二维曲线,然后从曲线中提取特征矩阵,最后利用矩阵的不变量对DNA序列进行聚类研究,这样,一条DNA序列就被转化成一个多维数据对象。因此,对DNA序列的聚类问题就转化成对多维数据的聚类了。现有的对多维数据进行聚类的常用聚类算法,通常需要事先给定聚类数k。但在大多数情况下,聚类数k事先无法确定,因此需要对最佳聚类数k进行优化处理。本文采用基于微粒群算法的聚类算法。为了解决微粒群聚类算法无法确定聚类数k的现象,通过k均值算法的引入,实现最佳聚类数k的求解和聚类有效性函数的构造,试验证明引入类间距离的聚类有效性检测函数对最佳聚类数判别科学,同时由于检测函数中类间距离权重的引入使该检测函数可以更好的应用于现实数据分析。

【Abstract】 With the continuous development of modern biological technology, especially the implement of the Human Genome Project, people have gradually acquired quantities of gene sequences data. Faced with such a large number of genetic sequence data, only a small part of them we have already known their functions, but most of the gene function is unknown. The clustering technology of Data mining is the technology capable of analysising a large number of gene data. Therefore, by clustering technology, these gene sequences are clustered, and we get some classes. because the gene sequences from one class have similar functions, So that, we can speculate the functions of unknown gene sequences using the known ones. The current research in the field of bioinformatics, clustering analysis has been widely used. The key question of clustering of biological sequences is how to characterize the similarity between sequences. The linear arrangement of the biological sequence data itself is sometimes difficult to reflect the degree of similarity, so in some cases, some similarity measure failure. Thus, affecting the quality of clustering results. Therefore, if the similarity measure designed starting entirely from the sequence itself, it will not get the real clustering results up to the biological observations, It brings some difficulties to the evolution study of DNA sequences. With the deeply research of the graphical expression of DNA sequences, Randic first proposed the use of graphical expression of DNA sequences to study the clustering of gene sequences. By this idea, We can cluster the sequences by the mathematical characteristics collected by the the graphical expression of DNA sequences. referring to existing two-dimensional graphical representation based on base Symmetry, I made some improvement and give a new graphical representation method of DNA sequences. The improved method can make a more space-saving, and this method can also reflect some of the biological features of DNA sequences more clearly. So according to mapping rules, each DNA sequence is translated into three two-dimensional curves, and then extract featural matrixs from the curves, and then cluster the DNA sequences using the matrix invariant, so that, a DNA sequence is transformed into a multi-dimensional data, and the clustering of DNA sequences is transformed into the clustering of multi-dimensional data .The existing common clustering algorithms of multi-dimensional data usually require giving the number of clusters k in advance. However, in most cases, the number of clusters k can not be determined in advance, so the best number of clusters k needs to be optimized. In this paper, I use the clustering algorithm based on particle swarm optimization. In order to solve that the clustering algorithm based on PSO can not determine the number of clusters k, by the k-means algorithm, achieve the best number of cluster k and the structuring of the cluster validity function. The testing has proved the effectiveness of cluster detection function to determine the best number of clusters, and because the introduction of the weights of classes, so that the detection function can be better applied to real data analysis.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络