节点文献

基于GEP和RS的大数据集分类模型研究

Research on Large Database Classification Models Based on RS and GEP

【作者】 胡卉颖

【导师】 钟智; 元昌安;

【作者基本信息】 广西师范学院 , 计算机应用技术, 2012, 硕士

【摘要】 分类作为数据分析形式的一种,它可以从大量的数据中提取描述所有对象的模型。由于分类是利用已知的模型对新的数据进行预测,因此它是一个很好的有监督的学习过程。一个好的分类规则能够让我们更好的认识这个类,同时有效的利用类中的这些数据。分类是数据挖掘中最重要的任务,它通过分析已知数据提取分类模型,然后使用该分类模型将接下来要分类的数据一一映射到指定的分类规则当中。分类已经被广泛的应用到机器学习、神经网络、性能等方面的预测。实际上分类的训练集大多是连续的、有噪音的、不完整的,这往往会影响分类的精度。为了提高分类的精度,本文首先采用临界值等宽区间离散法将连续数据离散化,然后利用粗糙集这一能够对不完整、冗余、缺失的知识进行处理的理论方法所具有的知识分类的特点,结合基因表达式编程的进化策略,重点研究在数据预处理层去除冗余、不完整数据,提出了一种基于基因表达式编程的粗糙集属性约简研究算法(Attribute Reduction of Rough Set Based on GeneExpression Programming,简称ARRS_GEP),最后针对当前分类规则提取存在规则繁多的问题,提出一个新的分类模型。该模型包括对数据准备、数据预处理、规则提取、规则测试、规则评价等过程。本文所作主要工作:(1)系统的阐述了分类、基因表达式编程和粗糙集理论的相关知识及研究现状,对粗糙集的核心内容属性约简问题进行了详细的介绍,指出遗传算法约简的不足。将遗传算法与基因表达式编程进行了比较,找出这两种进化算法的区别。(2)在对基因表达式编程进行理论分析的基础上,研究如何改进属性约简算法,提出了基于GEP的约简算法,即ARRS_GEP算法。采用不同的约简方法进行实验,验证ARRS_GEP算法的有效性。(3)分类问题中的很多算法都要求数据为离散的,比如,粗糙集等,本文针对这一问题提出采用临界值等宽区间离散法对连续特征进行离散。同时,对提取分类规则时存在的噪音数据的问题进行分析,提出在预处理层使用ARRS_GEP约简算法进行交叉、变异、重组、插串等操作,对条件属性进行约简,约简后再使用分类算法提取规则。(4)采用对某年上市公司失败的预测,对本文提出的分类模型进行验证,实验表明该模型减少了分类规则的复杂性,提取的分类规则简单,属性少。这表明该模型在知识约简和规则提取中是有效性。

【Abstract】 Classification,as one of data analysis ways,can extract the model which can describe all objects from the large amount of data. Because of using the known model to predict new data, Classification is a favourable supervised learning process. A good classification rule can make us not only understand this class better, but also use these data effectively.The classification is an important task in data mining, it extracts a model by analyzing the known attributes of training set. By using the model,we can map the data that will be classified to the specified classification rule one-on-one. Classification has been widely applied to machine learning, neural networks and performance prediction.In most cases, the training set of classification are continuous, noisy and incomplete actually, which will affect the accuracy of classification. In order to improve the accuracy of classification.Firstly,the paper uses a wide range of threshold discretization method to discretize continuous data.Secondly,this paper takes advantage of the rough set theory, which can deal with these incomplete, redundant, partial knowledge, and the evolutionary strategy of GEP. We focus on how to remove those redundant, continuous and partial data on the data preprocess layer.This paper proposed attribute reduction algorithm of Rough set based on Gene expression programming(GEP).Finally, to the question that the present classification rule is complicated, this paper proposes a new classification model, which includes data acquisition, preprocessing, discretization and reduction. The main work of this paper is as follows:(1) We systematically review the related literatures on classification,GEP and rough set theory’s;give a detailed discussion on the core content-reduction of rough set; point out the defect of the genetic algorithm reduction;and find the differences between Genetic algorithm and gene expression programming.(2) On the basis of theoretical analysis of GEP, this paper studies how to improve the attribute reduction algorithm,and proposes a reduction algorithm based on GEP,ARRS_GEP,and uses different reduction methods to verify the validity of the new algorithm.(3) Many algorithms in the classification task require discrete data, for example, rough sets, etc.To solve such a problem,this paper uses the wide range of threshold discrete method to discretize the continuous features.By analyzing the problem that there exists noisy data when we extract classification rule,this paper proposes to do these operations such as cross, variation, restructuring, inserted string,on the data link layer. After the reduction of condition attributes,we use the classification algorithm to extract the rule reduction.(4) To test and verify the proposed model, this paper has predicted one trading enterprise.The result shows that the model can reduce the complicacy of classification rule.The derived classification rule via the proposed method has fewer attributes, and is simple relatively.This indicates that the model is effective in knowledge reduction and rule extraction.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络