节点文献
基于距离学习的集成KNN分类器的研究
Research on Ensemble KNN Classifier Based on Metric-learning Method
【作者】 于飞;
【导师】 顾宏;
【作者基本信息】 大连理工大学 , 测试计量仪器, 2009, 硕士
【摘要】 近年来,数据挖掘引起了信息产业界的极大关注,其主要原因是存在可以广泛适用的大量数据,并且迫切需要将这些数据转换成有用的信息和知识。获取的信息和知识可以应用于各种领域,包括商务管理、生产控制、市场分析、工程设计和科学探索等。本文主要关注于数据挖掘的一个分支,即分类问题,综合了一种集成算法和一种改良的分类算法,设计了一个基于距离学习的集成的KNN分类器。这种分类器首先对数据集的所有属性进行了的过滤处理,计算训练集所有属性的信息增益,把信息增益小于某一阈值的属性作为不相关属性过滤掉。然后选择了装袋(Bagging)的集成方法来构建子分类器:一方面,利用自助(Bootstrap)法随机抽取了训练数据集的样本以建立多个子分类器,另一方面,对每一个已建立的子分类器的所有属性再次进行了随机剔除,这种对输入属性添加扰动的方法不但保证了子分类器准确性,同时也增加了子分类器之间的差异性。之后,每一个子分类器都选择一种基于距离学习的KNN分类算法来计算分类结果,其中KNN的距离学习模块采用了邻近成分分析(NCA)算法。最后,利用多数投票制综合分类结果,获得最终判定。实验数据表明,与单一的集成KNN分类器或者单一的距离学习KNN分类器相比,新分类器的正确率的得到了很大的提升。
【Abstract】 Recent years, a huge attention has been focused on the Data Mining in the science and information industry. As more data are gathered, with the amount of data doubling every year, data mining is becoming an increasingly important tool to transform these data into useful information. It is commonly used in a wide range of practices, such as marketing, surveillance, fraud detection and scientific discovery.This paper focuses on one of the branches in the area of Data Mining, the Classification. A new Ensemble learning algorithm which is based on a metric-learning KNN classifier is proposed in this paper. Firstly, in a filtering procedure, we use a information gain based threshold to filter the input attributes, According to the evaluation of information gain of all the original inputs attributes, the values which are less than the threshold f are deemed as irrelevant and removed. Secondly, in an assembling procedure of bagging, we use both a regular boostrap way to reshuffle all the instances in the filtered dataset and a perturbation which randomly picks out the input attributes of the filtered dataset to form several component learners. In this way a strong ensemble can be generated with both high accuracy and diversity. Thirdly, in a metric-learning procedure, all the component learners is classified by a reformed KNN classifier called Neighbor Component Analysis, which learned the KNN distance-metric by a designed optimizing method. Finally, in a combining procedure, a majority-voting strategy has been used to colligate all the results which component learners produced.A large empirical study shows that this algorithm has a better performance compared to other simple metric-learning algorithms and simple ensemble KNN classifiers.
【Key words】 Data Mining; Ensemble Learning; component Learner; Metric Learning;