节点文献

粒计算分类知识发现算法及其应用

Classification Knowledge Discovery Algorithms Based on Granular Computing and Its Applications

【作者】 罗建宏

【导师】 陈德钊;

【作者基本信息】 浙江大学 , 化工过程信息工程, 2010, 博士

【摘要】 人类正在步入一个以知识生产、应用为最重要因素的知识经济时代。以知识发现为核心的智能信息处理技术在知识的生产过程中具有越来越重要的作用。分类具备一般知识发现的数据预处理、数据挖掘、模型评估和知识表示的基本流程,是一项重要的知识发现任务。由于分类的广泛应用及其在化学化工领域的重要意义,对分类方法展开研究不仅可以促进数据挖掘技术的发展,还能极大地开拓化学化工领域中知识发现的应用前景。目前,分类知识发现的研究和技术已有长足进步,各种挖掘方法层出不穷,但一些有待研究的问题也日趋突出。尤其在化学化工领域,由于采集和积累的数据往往具有多因子、非线性、高噪音和非均匀分布等特点,常规的分析和处理方法不仅耗时,而且难以有效地挖掘和发现其中所隐含的知识,相关的分类知识发现方法和技术有待改进和发展,对之展开深入的研究,对促进化学化工学科的发展具有重要的意义,也会产生巨大的经济价值。粒计算是信息处理的一种新的概念和计算范式,覆盖了所有与粒度相关的理论、方法、技术和工具。粒计算的基本思想是模拟人类智能的特点,在求解复杂问题时,通过选择合适的粒度,降低问题求解的难度,有助于找到一种较好的解决方案。粒计算原理为知识发现的研究提供了新途径。但是,目前有关粒计算的研究大多还以理论研究为主,粒计算的应用研究较少,尤其在化学化工领域,更鲜见文献报道。本文归纳提出了粒计算用于知识发现的四项基本原理,利用此原理,对当前分类知识发现方面的若干挑战问题进行了研究,提出了相应的策略和方法,以用于化学化工领域中的相关问题。本文的主要研究工作和成果可归纳如下:1.粒化和聚类是一种对知识进行总结概括的方法,聚类生成的类刻画了数据所蕴涵的类知识。聚类分析,是软科学研究的重要的基础性方法,也是一种有效的手段。自适应共振(Adaptive Resonance Theory, ART)网络ART2用于聚类,具有许多优点。同时也存在对输入的渐变模式不敏感,抗噪音性能有限的缺点。为此,本文提出了改进的自适应共振网络(ART2 with Enhanced Triplex Matching mechanism, ETM-ART2),加强了内部检测机制,以提高ART2网络的性能,对橄榄油样本进行聚类分析试验,其聚类性能优良,尤适用于海量数据的聚类问题。ETM-ART2还可为分类问题构建信息粒,有助于知识发现,并提高分类性能。2.粒的构建是应用粒计算的基本步骤之一。本文根据粒度求解近似原理,提出了采用ART网络构建信息粒,可为分析对象方便、快速地建立合适的信息粒;又由GrC(Granularcomputing, GrC)问题简化原理,提出了基于信息粒的分类知识发现的求解方案。开发了两个算法:其一,基于信息粒的模糊分类知识发现算法(Information Granulation based Fuzzy Classification Knowledge Discovery Method, IG-FCKDM);其二,基于粒的关键特征分析(Key Feature Analysis based on Granulation, KFAG)、由C4.5实施分类规则挖掘的算法KFAG-C4.5。IG-FCKDM侧重于求解非均衡两分类问题和分类误差敏感问题,即分类判定错误可能带来巨大损失。它采用Fuzzy ART构建信息粒,继而通过模糊处理,提取分类规则。对疾病诊断的试验表明,IG-FCKDM处理此类问题效果较好,且其预测正确性和可信度对用户有更重要的意义。KFAG-C4.5可用于一般分类问题和多类非均衡分类问题。它采用ETM-ART2构建信息粒,再进行本文提出的基于粒的关键特征分析,并将各属性合理地划分为具有较强的类别区分能力的若干子属性,子属性数不致过多。使信息粒由子属性描述,并以离散值0或1表示。便于最后采用C4.5实施分类规则挖掘。对玻璃两分类和多类非均衡问题的试验表明,KFAG-C4.5具有较好的分类识别能力。IG-FCKDM和KFAG-C4.5这两个算法挖掘所得知识虽然表现形式有所不同,但都很简洁,可理解性好,易于各类专业人员分析,且较好地解决了非均衡数据的分类问题。3.集成学习常可提高单个分类器的性能,随着研究的深入,选择性集成学习逐渐成为研究热点。当前,基于随机优化算法的选择性集成算法,大多以泛化误差为目标,基本忽略了个体分类器本身的特性,尤其是差异性度量。这些方法也取得了一些成果,但计算复杂度较高,效率偏低。为解决个体分类器差异性度量的难题,本文基于GrC问题的等价原理,将选择性集成问题转换到较简单的关联空间,研究了一种简单而高效的选择机制,开发了基于知识粒、兼顾正确率和差异性的选择集成(Correctness and Diversity based Selective Ensemble, CDSE)算法。将其用于毒性作用机制的分类试验,其性能优于集成算法Bagging、AdaBoost.M1,以及单个C4.5分类器。CDSE从优选个体分类器的角度出发,为提高集成分类的泛化性能和效率提供了有效的解决方案。4.在集成分类器的构建生成和预测判定这两个层面上,提出了自适应的新思路,将CDSE拓展为自适应集成(Correctness and Diversity based Adaptive Selective Ensemble, CDASE)学习算法,进一步提高了集成分类的泛化性能。CDASE针对每一类别,自适应地生成特定适用的集成分类器,组合为集成分类器组AE-Group,其中各个集成分类器间存在包容性,故其占用的计算资源甚少,有效地减少了存储空间和计算时间。AE-Group又以自适应方式,即从集成分类器组中选用最适合的集成分类器对检测数据实施分类判定。用于多种模式分类问题的试验表明,CDASE算法以较少的个体分类器,即能实现较好的集成学习效果。与其它多种算法相比,CDASE具有良好的泛化性能,更为高效,且稳定性好。CDASE算法突破常规单一集成学习机适用性较窄的局限性,为进一步提高集成学习的泛化能力提供了新的思路。

【Abstract】 With our world enter in a knowledge economy era, knowledge production and application becomes one of the most important factors. Knowledge discovery, as the core of intelligent information processing technology, plays more and more important role in the knowledge production. The classification is one of the most important tasks in knowledge discovery, involved with data preprocessing, data mining, model evaluation and knowledge representation. In the field of chemistry and chemical engineering, the classification is wildly used and is also very important, so the study of the classification method can not only promote the development of data mining techniques, but also greatly expand the knowledge discovery application.At present, the researches and technologies of classification knowledge discovery have made significant progress, and at the same time a variety of data mining methods are used, but many prominent problems are still remain to be studied. Especially, in the field of chemistry and chemical engineering, because the collected data usually have characteristics of multi-factor, non-linear, high-noise and imbalance, the conventional data analysis and processing methods are not only time-consuming but also difficult to effectively mine the confidential message. If the relevant methods of classification knowledge discovery will be improved and developed, it can promote the development of chemistry and chemical engineering. And it is also significantly valuable in economy.Granular computing is a new concept and computing paradigm of information processing, covering all of the granularity related theories, methods, techniques and tools. In the course of solving a complex problem, the basic idea of granular computing is to simulate the human intelligence characteristics, select the appropriate granularity and reduce the complexity of solving problem, which is helpful to find a better solution, so granular computing provides a new way for knowledge discovery research. However, the current research on granular computing is mainly focused on theoretical research, and the application of granular computing is rarely concerned and reported especially in the field of chemistry and chemical engineering. In this paper, four basic principles of using granular computing into knowledge discovery are summarized, and then the research on adapting the principles to solve some challenges in classification knowledge discovery is provided, and at last the solution strategies and methods are proposed for the related problems in the field of chemistry and chemical engineering. The major works and achievements in this paper can be summarized as follows:1. Granulation and clustering is a kind of method of summarizing the classification knowledge, and the clustered class can present the confidential knowledge of data. Clustering analysis, as one of the important basic methods in soft science research, is an effective means. Adaptive Resonance Theory 2 (ART2) network has many advantages on clustering, but it also has some disadvantages, such as insensitivity to gradually changing of the input patterns and limited anti-noise performance. Therefore, an improved ART2 with Enhanced Triplex Matching mechanism (ETM-ART2) is proposed to improve clustering capability of ART2 networks. Experiments on cluttering the olive oil data sets show that the ETM-ART2 has a better clustering performance, and is particularly fit to be applied into the massive data clustering problems. The ETM-ART2 can also be used to construct the information granules in classification, which is helpful for knowledge discovery and improvement of classification performance.2. Constructing information granules is one of the basic steps in granular computing. Based on the principle of granularity knowledge discovery and granularity approximate solution summarized in this paper, a method of constructing information granules by ART network is proposed to analyze the research data conveniently and rapidly, and then a classification knowledge discovery solution based on information granules is also proposed according to the principle of problem be simplified by Granular Computing (GrC). Two algorithms are developed:one is the Information Granulation based Fuzzy Classification Knowledge Discovery Method (IG-FCKDM), and another one is the key feature analysis based on granulation (KFAG) for classification rules mining by C4.5, names as KFAG-C4.5. IG-FCKDM focuses on the imbalanced two-class problems and error-sensitive issue. The IG-FCKDM constructs the information granules by Fuzzy ART, and extracts the classification rule by fuzzy processing. Experiments by IG-FCKDM on a disease diagnosis problem show its better performance for this kind of problem and more important significance of prediction accuracy and credibility by IG-FCKDM. KFAG-C4.5, which can be used for general and multi-class imbalanced classification problems, uses ETM-ART2 to construct good information granules, and then analyzes the key feature based on granulation, and at last divide data’s attributes into some distinguishable sub-attributes reasonably, so the number of sub-attributes will not be large. The information granules can be presented by sub-attributes with discrete values of 0 or 1, in order to mine the classification rules by C4.5. Experiments on two-class problem of glass and imbalanced multi-class problem show the good classification capabilities of KFAG-C4.5. The messages mined by IG-FCKDM and KFAG-C4.5 are different in manifestations, but they are very concise, comprehensible, easy to analyze for various users, and effective to solve the imbalanced data classification problem.3. Ensemble learning usually can improve the performance of a single classifier, and the selective ensemble learning is focused on with deeper study, But at the present, the selective ensemble algorithms based on stochastic optimization algorithms mainly set the generalization error as the goal, and almost ignore the diversity of individual classifiers, especially the diversity measuring. Though some good results can be achieved, there exists the more complex computing and low efficiency. In order to solve the problem of measuring the diversity of the individual classifier, the selective ensemble learning problem is transformed into a simple correlation space based on GrC problem equivalent principle, and a simple, efficient selective ensemble mechanism is proposed. The Correctness and Diversity based Selective Ensemble (CDSE) algorithm is proposed, which integrates the accuracy with the diversity of individual classifiers, based on knowledge granules. Experiments on toxicity classification show that CDSE has better classification performance than other ensemble algorithms such as Bagging, AdaBoost.M1, and single classifier C4.5. In the view of selecting appropriate individual classifiers, CDSE provides an efficient solution to improve the generalization performance and efficiency of ensemble classifier.4. Based on the aspects of construction of ensemble classifiers and prediction determining, a new Correctness and Diversity based Adaptive Selective Ensemble (CDASE) learning algorithm is put forward, extending CDSE into an adaptive ensemble learning algorithm, in which the generalization performance of ensemble classifiers is improved. An appropriate ensemble classifier is adaptively generated for each category, so they form a group of ensemble classifiers called as AE-Group. Each one of them shares the same storage space, so AE-Group occupies fewer computation resources and less storage space. Then classification for test data is also adaptively decided by selecting appropriate ensemble classifier from AE-Group. Experiments on a multiple-class problem show CDASE has the better ensemble learning results with less individual classifier. Compared with other algorithms, CDASE has a good generalization performance, so it is more efficient and stable. CDASE overcomes the limitations on narrow application of single-ensemble learning algorithm, and provides a novel method to further improve the generalization capability of ensemble learning.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2012年 04期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络