节点文献

模糊聚类新算法及应用研究

Novel Fuzzy Clustering Algorithms and Applications

【作者】 高翠芳

【导师】 吴小俊;

【作者基本信息】 江南大学 , 轻工信息技术与工程, 2011, 博士

【摘要】 论文的研究内容主要包括模糊聚类新方法及其生物应用,以模糊聚类的创新理论为研究重点,以生物领域中的实际应用为背景,内容涉及计算智能技术与相关生物学应用的结合问题,属于交叉学科的研究课题,具有十分重要的理论意义和实际应用价值。论文的研究路线分为两条,首先提出了若干模糊聚类新算法,丰富和完善了模式识别中有关聚类的理论与方法。然后针对生物领域的实际应用问题,研究了面向复杂生物数据集的计算智能新理论,利用计算智能深入的数据分析和信息挖掘能力,揭示大量生物数据之间复杂的相互关系,以此实现理论与应用两条技术路线在生物信息学中的统一。论文的主要工作如下:1.研究了基于核方法的模糊聚类算法:协同模糊核聚类算法和加权模糊核聚类算法。将协同关系函数引入模糊核聚类算法的目标函数中,得到一种新的协同模糊核聚类算法。该算法的特点是通过核方法把数据映射到高维特征空间以扩大样本之间的差异性,并且能用一个目标函数处理多个特征子集的数据,将模糊核聚类算法在不同特征子集上进行协同,使各类中心点的区分更加明显,得到了聚类效果更好的新算法。另外针对加权模糊核聚类算法(WFKCA)容易陷入局部最优的问题,提出了一种改进算法,将迭代自组织数据分析算法(ISODATA)的思想引入到WFKCA算法中,利用聚类中心分裂/合并的中间结果来调整初始中心。改进算法采用特征空间中的计算度量,并增加了对聚类中心的调整幅度,聚类性能更稳定。2.研究了基于模糊散布矩阵的聚类算法及其应用,首先对基于模糊Fisher准则(FFC)的聚类算法的性能进行了改进研究。针对已有算法类中心计算式不准确的问题,提出采用更合理的类中心迭代式的新方法,获得了更好的聚类性能。然后基于模糊Fisher聚类算法在聚类时能得到最优投影矢量,设计了一种适合生物领域智能预测的分类器,它不同于有监督和无监督聚类,是一种整合的模糊Fisher聚类算法,并用于识别分泌性蛋白的信号肽。当用户本身拥有高可靠性的训练样本时,模糊Fisher分类器能很方便地满足用户对模型训练的需求。最后对于维数较高且结构复杂的生物数据集,提出一种自动确定最佳聚类数目的方法,该方法充分体现“类内紧凑类间离散”的思想,结合目标函数二阶差分的判定准则,通过聚类算法的自学习来确定复杂生物数据集的合理聚类数目。3.已有的蛋白序列特征提取方法是对整条独立序列的特征提取,不适用于替换局部信号肽序列以后的外源蛋白质。因此我们将信号肽与外源蛋白之间的相容程度定义为结构融合度,从数学角度分析信号肽拼接以后与邻近残基之间的相互作用,提出信号肽拼接区域与目标蛋白之间的数学模型。将从模型提取的结构融合度特征用于识别外源蛋白的可分泌性,取得了满意的实验结果。4.对近期提出的一种基于点对约束的半监督模糊聚类算法进行了研究,研究发现其约束项与原算法的目标函数之间数量级不一致,是造成隶属度调整过度的主要原因。针对该问题,我们在重新定义目标函数的基础上提出了改进算法,引入新的约束惩罚函数,通过优化求解带约束惩罚条件的目标函数得到了新的半监督聚类算法。新的约束项与原目标函数之间能很好地协调合作,并能通过对隶属度的适当调整得到更好的聚类效果。

【Abstract】 Novel fuzzy clustering algorithms and their applications in biological fields are studied with the emphasis on fuzzy clustering theories and the practical problems in applied biotechnology. It is the interdisciplinary subject of computational intelligence and applied biotechnology-related topic, and is of great significance for both theoretical research and practical applications. There are two research routines in the thesis. For the theoretical research on fuzzy clustering, new fuzzy clustering algorithms are proposed. For the bioinformatics research, new theories of computational intelligence are proposed that aims to solve practical problems in applied biotechnology. Facilitated by the useful tool of computational intelligence of mining the complex information among the biological data, the two research routines of theory and application are integrated. The main contributions in the thesis are summarized as follows.1.Two new fuzzy clustering algorithms based on kernel method are proposed including collaborative kernel fuzzy clustering and weighted fuzzy kernel clustering. An improved collaborative kernel fuzzy c-means clustering (CKFCM) algorithm is proposed, in which the function of collaborative relationship was incorporated into kernel fuzzy c-means clustering (KFCM). CKFCM can map the observed data to a higher dimensional feature space with a kernel function which can enlarge the difference among samples, and CKFCM implementing on several subsets can be processed together with an objective function, which improves the clustering performance by collaborating partition matrices among different feature subsets. So CKFCM achieves better classification by more separable centers, and is an effective clustering with better performance. An improved algorithm of weighted fuzzy kernel clustering (WFKCA) is proposed to overcome its shortcoming of liability to stick to a local optimum. The idea of iterative self-organizing data analysis techniques algorithm (ISODATA) is introduced into the WFKCA, and initial center vectors are adjusted by the intermediate results from splitting and/or merging of clustering centers to reduce the possibility of local optimum. The improved algorithm uses matchable measurement from feature space, and increases the adjustment range of clustering centers, so it achieves more stable performance of clustering.2.Studies are made on the clustering algorithms based on fuzzy scatter matrices. Firstly, aiming at the problem that previous algorithms use inaccurate iterative expression of cluster centers, an improved clustering based on Fuzzy Fisher Criterion (FFC) with new centers equations is proposed. Secondly, an integrated fuzzy fisher clustering (IFFC) by combining the supervised and unsupervised clusterings is developed, and a novel classifier based on IFFC for recognizing secretory proteins is designed. The classifier is suitable for intelligent prediction in biology area and is convenient for users to train the model. Lastly, an automatic technique to determine the reasonable cluster number of complex biological datasets is proposed. The significant calculation is implemented by an optimization algorithm that reflects the idea of compactness of intra-cluster and separability of inter-cluster, then the reasonable cluster number is determined by using the maximum criteria of second order difference of objective function. The new method can automatically get the reasonable cluster number for complex datasets.3. Previous methods of feature extraction for protein sequence are suitable for the independent sequence, which are limited for heterologous proteins that in-frame fuse signal peptide. A structural fusion degree (SFD) is defined to determine the compatibility degree of target proteins and signal peptides, and the interaction between fused signal peptides and adjacent residues of proteins is analyzed mathematically. A mathematical model of extended signal region and the protein is proposed. SFD features are extracted from this model to recognize the secretability of heterologous proteins, and satisfactory results are obtained by the proposed model.4. A study is made on a recently developed semi-supervised fuzzy clustering algorithm with pairwise constraints, in which the disagreement on the magnitude order between penalty cost function and the basic objective function will cause over adjustment of membership values. In order to solve this problem, an improved algorithm is proposed based on a redefined objective function. A new constraint function is incorporated additively as a penalty cost of basic objective function to obtain a new semi-supervised optimization problem. The new penalty cost function can achieve a good agreement and cooperation with the basic objective function and can produce more accurate clustering results by moderately enhancing or reducing the ambiguous membership values.

  • 【网络出版投稿人】 江南大学
  • 【网络出版年期】2011年 08期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络