节点文献

基于粗糙集和遗传算法的聚类方法研究

Research of Clustering Algorithm Based on Rough Set and Genetic Algorithm

【作者】 李琳

【导师】 张超英;

【作者基本信息】 广西师范大学 , 计算机软件与理论, 2009, 硕士

【摘要】 数据挖掘技术是机器学习、数据库和统计理论相结合的产物,是从大量的、不完全的、有噪声的、模糊的、随机的实际数据中,提取隐含的、先前未知的并有潜在价值的信息的非平凡过程。在数据挖掘领域中,聚类分析是一项重要的研究课题。与分类不同,聚类的目标是在没有任何先验知识的前提下,根据数据的相似性将数据聚合成不同的簇,使得相同簇中的元素尽可能相似,不同簇中的元素差别尽可能大,因此又被称为非监督分类。聚类分析作为数据挖掘系统中的一个模块,既可以作为一个单独的工具以发现数据库中数据分布的深层信息,也可以作为其他数据挖掘分析算法的一个预处理步骤,因此研究如何提高聚类算法的性能具有重要的意义。遗传算法是基于生物进化的概念设计了一系列过程来达到优化的目的。这些过程包括:基因组合、交叉、变异、自然选择。在这些过程中,通过“优胜劣汰”的原则来淘汰掉解较差的基因,使得解朝着好的方向发展。遗传算法从一组初始可行解出发在只需要目标函数这一信息的条件下实现对可行域的全局高效搜索并以概率1收敛到全局最优解,这种良好的特性使得遗传算法成为组合优化和函数优化的有力工具,并成为计算智能领域的研究热点.粗糙集理论是一种刻画不确定性和不完整性知识的数学工具,由波兰数学家在上世纪八十年代初首先提出的。粗糙集理论善于分析隐藏在数据中的事实而不需要关于数据的任何附加知识。该理论以其独特的优势正赢得越来越多的研究者的关注,并在各个领域获得了广泛的应用。在数据挖掘领域,粗糙集最初主要用于分类,而今有关粗糙集的研究已深入到该领域的各个方面。目前所用的聚类方法大多是基于对数值属性进行处理的,并且对数值进行处理的方法比较多。而聚类算法中针对符号属性的数据处理则比较困难,往往都是使用概念聚类方法,或者将符号属性转化为数值属性的方法。但是前者过于复杂也不成熟,后者对于数据的符号属性选择有局限性。所以目前大部分的聚类算法都面向数值属性,针对符号属性的则比较少。所以本文提出的算法主要是研究符号属性的数据。粗糙集理论适合用于数据之间(精确的或近似的)依赖关系发现、评价某一分类(属性)的重要性、数据相似或差异发现。经典粗糙集模型比较好的解决了符号型数据的机器学习问题,尤其是符号数据的特征选择、属性约简和规则归纳问题。所以说粗糙集特别适合于处理符号属性的数据。在提高聚类算法的性能方面,遗传聚类算法可较好地解决聚类时的优化问题以及满足优化目标的多样性需求。适应度是遗传算法得以进行下去的关键。由于有了适应度,个体之间才存在竞争。遗传算法的目标函数及适应度函数定义具有很大的灵活性,可根据需要进行定义。遗传算法是可调节的、鲁棒的、高效率的随机搜索算法,它具有的并行性、易于和其它模型结合等性质,适用于数据挖掘,但遗传算法较复杂,容易收敛于局部极小值。粗糙集不需要给出数据之外的额外信息,可以简化输入信息的表达空间,算法简单,易于操作,粗糙集处理的对象是类似二维关系表的信息表,也适用于数据挖掘。遗传算法与粗糙集理论具有优势互补的特点,可以将两者结合应用到聚类中。本文将粗糙集思想与遗传算法结合,提出了一种新的聚类方法。聚类算法质量的一个要求就是高类内相似度、低类间相似度,所以在本文中应用类内相似度和类间相似度来定义遗传算法的适应度函数。由于粗糙集的广义近似空间提出了类内不可区分度和类间不可区分度,所以可以将此思想应用到遗传算法中的适应度函数定义中。本文提出了一种新的面向符号属性的聚类算法(RNGACA)。该算法对于每个不同的值,采用自顶向下的分裂式层次聚类策略,利用RAGA算法对数据集进行逐层二分,直到达到预先指定的聚类的个数,然后输出聚类结果。RAGA算法则是将粗糙集思想和自适应遗传算法结合,对数据进行二分。为了验证该算法,做了4部分实验,第一部分是对4组实验数据进行测试,4组数据均是取自UCI机器学习数据库,该部分以聚类准确率为衡量准则,将RAGACA算法同其他3种算法进行比较;第二部分实验测试是根据基于F-measure方法的测试结果来衡量RAGACA算法和其他两种算法;第三部分是分析RAGACA算法中RAGA算法的收敛性,通过比较RAGA算法与使用标准遗传算法和使用普通自适应遗传算法来分析它们的收敛性;第四部分是分析RAGACA算法的时间复杂度和空间复杂度。通过这四部分实验,可以分析出RAGACA对符号属性数据进行聚类的可行性,以及拥有较高的准确率和收敛性,另外时间复杂度和空间复杂度也并不比其他算法差。

【Abstract】 The data mining technique is a combination of machine learning, database and Statistical theory. Data mining can seek interesting or valuable information within large, incomplete, noisy, rough and random databases. Cluster analysis is an important research problem in the domain of data mining. The goal of clustering is to classify data set into such clusters that intra-cluster data are similar and inter-cluster data are dissimilar without any prior known as“unsupervised classification”. Cluster analysis as a module in the system of data mining can be used not only as a separate technique to discover the information about data distribution, but also as the preprocessing of other data mining operations, therefore it is very meaningful to research how to improve the performance of clustering algorithms.Genetic algorithm based on the conception of biological evolution designs a series of process to optimize the solution. These processes include gene combination、crossover、variation、natural selection. In these procedures, eliminate the bad gene through the principle of“Survival of the fittest”and develop the solution to better direction. Genetic algorithm begins with a group of initial feasible solutions, achieves the global effective search of the feasible field with the only information object function and converges to the global optimum value with the probability 1, this kind of nicer characteristic make the genetic algorithm a useful tool for combination and function optimization. The genetic algorithm becomes the research hotspot in the field of computational intelligence.Rough set theory (RST) is a mathematical tool used for dealing with vagueness and uncertainty which is introduced by Pawlak, in the early 1980s.Rough set theory is good at analyzing the facts hidden in the data without any additional knowledge about the data. Due to its particular advantages, rough set theory has been received more and more attentions from researchers and applied in a variety of areas in recent years. In the domain of data mining, rough set was only used for classification at the beginning, but the research about rough set has already expanded to any aspects of data mining today.We are used in most of the clustering method is based on dealing with numerical attributes currently, and more is to deal with numerical methods. And clustering algorithm for symbolic attributes in the data processing is more difficult, often use the concept of clustering method, or property will be transformed into symbols of the methods of numerical attributes. The former is too complicated but not mature, which is the symbol for the data attribute selection is limited. Therefore, most all of the clustering algorithm for numerical attributes, the attributes for the symbols is relatively small. Therefore, the algorithm proposed in this paper is to study the clustering algorithm for the data of symbolic attributes.Rough set theory was suitable for data (precise or approximate) dependence, evaluation of a classification (attributes) the importance of similarity or difference between the data found. Classical rough set model to solve relatively good data symbols of the machine learning problem, in particular, feature selection data symbol, attribute reduction and rule induction problem. So, Rough set is suited to deal with the data which has symbolic attributes.In improving the performance of clustering algorithm, the genetic algorithm can solve the optimization problem at the time of clustering, as well as the diversity of objectives to meet the needs of optimization. Genetic algorithm fitness is the key to continue. Because of the fitness, the competition can be existent. The objective function of genetic algorithm and the definition of fitness function with a great deal of flexibility can be defined.Genetic algorithm is adjustable, robust and efficient random search algorithm, it is the parallelism, easy, and other models such as the nature, applicable to data mining, but more complex genetic algorithm can easily converge to local minimum value. Rough set data do not need to give additional information, can simplify the expression of input space, the algorithm is simple, easy to operate, the target of rough set to deal with two-dimensional relationship is similar to the information in table form, but also to data mining. Genetic algorithm and rough set theory with the characteristics of complementary advantages, a combination of both can be applied to clustering.This article will be thinking of rough set and genetic algorithms, a new method of clustering. The quality of clustering algorithm is a requirement of high-class similarity and low between-class similarity, so in this paper the application of category similarity and between-class similarity of the genetic algorithm to define the fitness function. As a result of generalized rough set approximation space within a category and can not distinguish between types of non-discrimination, so this thinking can be applied to genetic algorithms in the definition of fitness function. In this paper, a new attribute-oriented clustering algorithm symbols (RNGACA). The algorithm for each different value, the use of top-down hierarchical clustering split strategy RAGA algorithm using data sets of the second sub-layer by layer until it reaches pre-specified number of cluster, and then output the results of clustering.There have been four experiments to verify the algorithm. First, four groups of data which were taken from the UCI machine learning database have been used to measure the algorithm. And use the clustering accuracy to test the RAGACA algorithm and other three algorithms. Second, the experimental test is based on F-measure method to measure the other two algorithms and RAGACA algorithm. Third, the experimental test is the analysis of algorithm convergence RAGA algorithm, by comparing the algorithm with the use of RAGA standard genetic algorithm and adaptive genetic algorithm using the general analysis of their convergence. Fourth, this part is to analyze the algorithm’s time complexity and space complexity. Through the four experiments, you can analyze that the feasibility of the RAGACA could cluster of the data with the symbols attributes. And the RAGACA has higher accuracy and convergence, and time complexity and space complexity is not worse than other algorithms.

【关键词】 数据挖掘聚类粗糙集遗传算法自适应
【Key words】 Data miningclusteringrough setgenetic algorithmsadaptive
节点文献中: 

本文链接的文献网络图示:

本文的引文网络