节点文献

数据挖掘分类算法的研究与应用

【作者】 刘振岩

【导师】 王万森;

【作者基本信息】 首都师范大学 , 计算机应用技术, 2003, 硕士

【摘要】 随着数据库技术的成熟应用和Internet的迅速发展,人类积累的数据量正在以指数速度增长。对于这些数据,人们已经不满足于传统的查询、统计分析手段,而需要发现更深层次的规律,对决策或科研工作提供更有效的决策支持。正是为了满足这种要求,从大量数据中提取出隐藏在其中的有用信息,将机器学习应用于大型数据库的数据挖掘(Data Mining)技术得到了长足的发展。 所谓数据挖掘(Data Mining,DM),也可以称为数据库中的知识发现(Knowledge Discover Database,KDD),就是从大量的、不完全的、有噪声的、模糊的、随机的数据中,提取隐含在其中的、人们事先不知道的、但又是潜在有用的信息和知识的过程。发现了的知识可以被用于信息管理、查询优化、决策支持、过程控制等,还可以用于数据自身的维护。因此,数据挖掘是数据库研究中的一个很有应用价值的新领域,它又是一门广义的交叉学科,融合了数据库、人工智能、机器学习、统计学等多个领域的理论和技术。 分类在数据挖掘中是一项非常重要的任务,目前在商业上应用最多。分类的目的是学会一个分类函数或分类模型,该模型能把数据库中的数据项映射到给定类别中的某一个。许多分类的方法已被机器学习、专家系统、统计学和神经生物学方面的研究者提出。本论文主要侧重数据挖掘中分类算法的研究,并将分类算法划分为急切分类和懒散分类,全部研究内容基本围绕着这种划分方法展开。 本文的主要研究内容: 1.讨论了数据挖掘中分类的基本技术,包括数据分类的过程,分类数据所需的数据预处理技术,以及分类方法的比较和评估标准;比较了几种典型的分类算法,包括决策树、k-最近邻分类、神经网络算法;接着,引出本文的研究重点,即将分类算法划分为急切分类和懒散分类,并基于这种划分展开对数据挖掘分类算法的研究。 2.结合对决策树方法的研究,重点研究并实现了一个“懒散的基于模型的分类”思想的…懒散的决策树算法”。在决策树方法的研究中,阐述了决策树的基本概念以及决策树的优缺点,决策树方法的应用状况,分析了决策树算法的进一步的研究重点。为了更好地满足网络环境下的应用需求,结合传统的决策树方法,基于“懒散的基于模型的分类”的思想,实现了一个网络环境下基于B/S模式的“懒散的决策树算法”。实践表明:在WEB应用程序中采用此算法取得了很好的效果。 3.选取神经网络分类算法作为急切分类算法的代表进行深入的研究。在神经网络中,重点分析研究了感知器基本模型,包括感知器基本模型的构造及其学习算法,模型的几何意义及其局限性,并针对该模型只有在线性可分的情况下才能用感知器的学习算法进行分类的这一固有局限性,研究并推广了感知器模型。首都帅范大学硕士学位论文 数据挖掘分类算法的研究与应用 4.重点研究了一类感知器推广模型——代数超曲面神经网络模型。在这一 部分,首先介绍了代数超曲面神经网络模型的构造及其几何意义;然后, 详细阐述了代数超曲面神经网络学习算法的具体实现,以及此算法的实 验结果和创新之处;最后提出了进一步的研究目标。代数超曲面神经网 络模型在解决非线性问题上有很大的潜力,尤其对高维非线性数据分类 有独特优势。本研究的创新之处是算法的自适应升次计算,研究表明: 采用自适应建模方式后,大大提高了建模成功率。但是,对高维数据的 分类,存在内存受限的问题,还需要进一步的深入研究。

【Abstract】 With the application of Database and the development of Internet, accumulated data are exponential increasing. For these data people are not satisfied with the traditional methods of queries and statistics, but want to find deeper regulations to provide effective decision to science and research works. So data mining technology that apply machine learning to large database to acquire useful information from a lot of data is developed.Data mining (DM) or knowledge discover database (KDD) is to discover useful information and potential knowledge from plentiful and uncompleted and noise and fuzzy and random data which are hided and not known by people. These discovered knowledge might be used to manage information and optimize queries and make decision and control procedure and maintain database and so on. So data mining is a very valued new area of database research area, and it is a crossed subject that adopts theory and technology of database and artificial intelligent and machine learning and statistics and so on.Classification is a very important task in data mining and extensively applied to commerce at present. The destination of classification is to learn a classification function or classification model that can map a data item to a preassigned class. The researcher of machine learning and expert system and neural biology provides a lot of classification methods. This paper does some research works about classification algorithm in data mining. Classification algorithm is divided to eager and lazy and total research works are based on this divide.The main work of the thesis:1. The base technologies of classification in data mining are introduced. These technologies include the procedure of classification and the preprocessing of classification data and compared and evaluated criterion of classification methods. Several of typical classification algorithms are compared which are decision-tree and k-nearest neighbor and neural network algorithm. Then the emphasis of the paper is induced that divide the classification to eager and lazy and the research of classification algorithm in data mining is based on this divide.2. A lazy decision-tree algorithm that comes from the idea of lazy classification based on model is researched on the base of the research of the traditional decision-tree. In traditional decision-tree, the concepts and advantages and disadvantages of decision-tree are presented, and the application and research situation of decision-tree are analyzed. Appling to web environment a web application used lazy decision-tree algorithm that comes from the idea of lazybased on model classificaton is developed. And the practical run shows this method acquired better grade.3. Neural network is deeply researched as representation of eager classification. Perceptron is selected. At first the creation of typical perceptron model and its learn algorithm are introduced. Then on the base of the principal and geometrical presentation of typical perception model, the limitations of typical perceptron model are studied. This limitation is that perceptron learn algorithm can be used only when data are linear separability. To resolve this problem, expanded perceptron models are research.4. Algebra hyper surface neutral network is a kind of expanded perceptron model. This model is an emphasis of this paper. At first the creation of this model and its geometrical presentation are introduced. Then it’s learning algorithm is accomplished and test’s results and innovation of program are presented. At last the further aims are provide base on test’s conclusion. This model is potential to resolve nonlinear separability problems; especially it adapts to classify high-dimmension data. Adaptive raise degree computer method is the innovation of research. Researches show that success rate of creating model raise after using the adaptive method. But it exists the limitation of memory for high-dimension data. So a deeply research will be continued.

  • 【分类号】TP311.13
  • 【被引频次】13
  • 【下载频次】1668
节点文献中: