节点文献

基于遗传算法与模糊聚类的文本分类研究

Research on Text Categorization Based on Genetic Algorithm and Fuzzy Clustering

【作者】 于水英

【导师】 丁华福;

【作者基本信息】 哈尔滨理工大学 , 计算机应用技术, 2009, 硕士

【摘要】 随着数据的爆炸式增长,信息处理已经成为人们获取有用信息不可缺少的工具,文本分类也已成为重要研究方向。作为非监督学习方法的模糊聚类分析已成为文本分类研究的热点,对基于模糊聚类的文本分类研究具有重大的理论和现实意义。然而,模糊聚类算法存在初始值敏感问题。因此,本文提出了一种遗传算法优化模糊聚类的文本分类算法。本文对模糊C-均值(FCM)聚类算法的一种改进算法-特征加权的FCM(WFCM)聚类算法,与FCM算法进行了测试比较。结果表明,WFCM聚类算法提高聚类的正确率。遗传算法是一种高效率的随机全局优化搜索算法,本文将遗传算法与FCM结合产生基于遗传算法的特征加权的FCM(WFCM)聚类算法(GWFCM),充分发挥FCM的局部搜索和遗传算法的全局搜索能力。本文在研究现有聚类类别数目自动学习的基础上,对聚类的有效性判断加以改进,在算法中动态改变聚类类别数目,以提高聚类的有效性和精确性。针对编码特征的问题,本文引入一个基因平均差异度的概念,算法的执行过程中,交叉和变异算子,动态地计算基因平均差异度值,使用该值以限制适应度差的个体产生,从而优化了遗传算法的执行性能。这种聚类方法在性能上比经典的聚类算法有较大的改进,它通过非线性映射能够较好地分辨、提取并放大有用的特征。由于在遗传算法的应用中,采用了比例选择算子,会产生进化早期的早熟收敛和进化后期的搜索效率下降等问题。为此,本文提出一种非线性排序选择机制。在群体进化过程中,本文实施精英基因引入策略确保了遗传进化的稳定性,避免无效解的扩散,从而保证了算法的收敛性,确保了遗传进化的稳定性,提高了对聚类中心的搜索效率。为了验证本文所提算法的高效性和可行性,我们将GWFCM与FCM、WFCM进行,抽取大量文本进行实验。通过实验可以看出GWFCM较WFCM的查准率、查全率和F1值分别提高了0.030、0.022、0.026,GWFCM算法相对于其它方法在文本分类和聚类中具有很好的表现。

【Abstract】 Along with the data’s explosive growing, information processing has become a indispensable tool for people to acquire useful message, so that text categorization is the important research direction. Fuzzy clustering analysis, as a kind of unsupervised learning methods, is a research hotspot concerning about text categorization. Therefore the research of text categorization based on fuzzy clustering is hence of great theoretical and practical significance. However, fuzzy clustering algorithm exist initial value sensitivity problem. Therefore, In this paper, a fuzzy clustering algorithm based on genetic algorithm is proposed.This paper test and comparison of fuzzy C-means clustering(FCM) and weighted FCM(WFCM) clustering algorithm, which is a improvement of FCM. the results show that WFCM clustering algorithm improved the fuzzy clustering’s accuracy rate. Genetic algorithms are a high efficient global optimization stochastic search algorithm, this paper combines genetic algorithm with WFCM, the characteristics of weighted FCM clustering algorithms based on genetic algorithms (GWFCM) is put forward, which making full use of FCM local search virtue and global search ability of genetic algorithm. In this paper, at the basis of study clustering class number automatically learning, improve the effective judgment of clustering, dynamic changes clustering class number in algorithm, the validity and precision of clustering is advanced.Aiming at coding characteristics problems, in this article a concept of degree of genetic variation is introduced. In the algorithm implementation process, crossover and mutation operator, the dynamically calculated value of genetic variation, the value to limit the bad fitness individual production, So as to the optimize execution performance of genetic algorithm. This clustering method is greater improvement than classical clustering algorithms in performance. Through non-linear mapping, it can better distinguish extract and amplify useful features.Due to using proportional selection operator in the application of genetic algorithm, there are some questions, which are premature convergence in early evolution and search efficiency decline late evolution. For these reason, in this paper a kind of non-linear selection mechanism is proposed. In the group improvement process, this paper propose elite gene introduce policy, So as to ensure the stability of genetic evolution, on cluster centers improve search efficiency.In order to confirm the efficiency and feasibility of our algorithm, We compare GWFCM with FCM and WFCM. extract a lot of texts experimentize, The experiment results shown that the precision、recall and F1 improved 0.030、0.022and 0.026 separately. GWFCM has better performance than other methods in text categorization and clustering.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络