节点文献

在划分数据空间的视角下基于决策边界的分类器研究

Studies on Classifiers Based on Decision Boundaries from the Perspective of Dividing Data Space

【作者】 严志永

【导师】 潘云鹤;

【作者基本信息】 浙江大学 , 计算机科学与技术, 2011, 博士

【摘要】 分类器是机器学习的一项重要技术。分类器研究中存在映射和划分两种视角。在映射视角下分类模型可被看作从数据空间到标签集的映射,分类器的训练过程可被看作在假设空间中搜索最优假设的过程。在划分视角下分类模型可被看作将数据空间划分成若干决策区域的一组决策边界,分类器的训练过程可被看作划分数据空间获得决策边界的过程。映射视角是主流,在映射视角下研究分类器的工作很多。目前还没有在划分视角下对分类器进行系统研究的工作。本文在划分视角下以决策边界为工具对分类器进行研究,进行构建在划分数据空间视角下以决策边界研究分类器的理论框架和基于此理论框架改进分类器两方面的研究。本文的研究工作主要有:1)提出了决策边界、决策区域和概率梯度区域的定义。提出了获取决策边界的形式化方法和采样法。提出了决策边界点集(Decision Boundary Point Set,简称DBPS)算法、决策边界2D网格点集(Decision Boundary Point Set using Grid for 2-D data,简称DBPSG-2D)算法和决策边界神经元集(Decision Boundary Neuron Set,简称DBNS)算法来获取决策边界附近的采样点。提出了基于自组织映射的决策边界可视化(Self-Organizing Mapping based Decision Boundary Visualization,简称SOMDBV)算法和基于自组织映射的概率梯度区域可视化(Self-Organizing Mapping based Probability Gradient Regions Visualization,简称SOMPGRV)算法来分别对决策边界和概率梯度区域进行可视化。2)提出了在划分数据空间视角下基于决策边界的分类器三要素九因素理论框架。在此理论框架下,划分目标、决策边界形式和划分方法是分类器的三要素。划分目标需要考虑训练准确率、错分样本特征和决策边界的微位置三个因素;决策边界形式需要考虑划分能力、提供的领域知识和可理解性三个因素;划分方法需要考虑利用的信息、划分方式和分类模型复杂度三个因素。3)提出了基于K近邻(K nearest Neighbors,简称KN)类型的错分样本特征。KN类型根据样本与其K近邻之间的类别关系,将样本分为S类、DS类和D类三类。C4.5算法、Naive Bayes分类器和支持向量机(Support Vector Machine,简称SVM)三个分类器与K近邻(K Nearest Neighbors,简称KNN)算法在KN类型上的错分样本特征有着显著不同。提出了组合KNN算法和C4.5算法/Naive Bayes分类器/SVM的K近邻组合(Knearest Neighbors Combining,简称KNC)算法。KNC算法使用KNN算法来对S类和DS类样本进行预测,使用其他三个分类器对D类样本进行预测。4)研究了离散化算法对分类器决策边界的影响。提出了离散化算法能够提高Naive Bayes分类器泛化能力的原因在于离散化算法能够提高Naive Bayes分类器的Vapnik-Chervonenkis (VC)维。将离散化算法应用于SVM和KNN算法,并研究了离散化算法对SVM和KNN算法的VC维的影响。5)提出了在Naive Bayes分类器的决策区域内训练分类器的二次划分(Second Division,简称SD)算法,并对现有的局部分类器训练算法进行研究。SD算法是一种组合全局学习和局部学习的算法,因此能够提高Naive Bayes分类器的泛化能力。将现有的局部分类器训练算法分为测试选择、划分覆盖和训练选择三类。并提出了训练局部分类器能够提高分类器泛化能力的原因在于其能够提高分类器的VC维和能够利用训练数据集中更多信息。

【Abstract】 Classifier is an important technique of machine learning. In classifier studies, there are two perspectives, the mapping perspective and the dividing perspective respectively. From the mapping perspective, a classifier model can be regarded as a mapping from the data space to the label set, and the training process of a classifier can be regarded as searching the hypotheses space for the appropriate one. From the dividing respective, a classifier model can be regarded as a group of decision boundaries dividing the data space into several decision regions, and the training process of a classifier can be regarded as dividing the data space to obtain decision boundaries. The mapping respective is the mainstream, and there are many studies from this perspective. There are no systematic studies on classifiers from the dividing perspective. This dissertation adopts decision boundaries to study classifiers from the dividing perspective. This dissertation will construct the theoretical framework based on decision boundaries from the dividing perspective and improve classifiers based on the new theoretical framework.Studies of this dissertation are as follows.1) This dissertation presents formal definitions of decision boundary, decision region and probability gradient region. This dissertation proposes two methods for obtaining decision boundaries, the formal method and the sampling method. This dissertation proposes Decision Boundary Point Set (DBPS) algorithm, Decision Boundary Point Set using Grid for 2-D data (DBPSG-2D) algorithm and Decision Boundary Neuron Set (DBNS) algorithm to obtain sampling points near decision boundaries. This dissertation proposes Self-Organizing Mapping based Decision Boundary Visualization (SOMDBV) algorithm and Self-Organizing Mapping based Probability Gradient Regions Visualization (SOMPGRV) algorithm to visualize decision boundaries and probability gradient regions.2) This dissertation proposes a theoretical framework based on decision boundaries from the perspective of dividing the data space. In this new theoretical framework, the dividing objective, the decision boundary form and the dividing methode are three elements of a classifier. The dividing objective needs to consider three factors, the training accuracy, the characteristic of misclassified instances and the micro-location of decision boundaries. The decision boundary form needs to consider three factors, the dividing capability, the domain knowledge provided and the comprehensibility. The dividing method needs to consider three factors, the information utilized, the dividing pattern and the complexity of the classification model.3) This dissertation proposes a new characteristic of misclassified instances based on the K nearest Neighbors (KN) type. According to the label relationship between an instance and its K nearest neighbors, instances can be divided into three KN types, S-type, DS-type and D-type. The characteristic of misclassified instances of K Nearest Neighbors (KNN) algorithm on the KN type is different from those of other three classifiers, C4.5 algorithm, Naive Bayes Classifier and support vector machine (SVM). This dissertation proposes K nearest Neighbors Combining (KNC) algorithm to combine KNN algorithm and C4.5 algorithm/Naive Bayes Classifier/ SVM. KNC algorithm uses KNN algorithm to make predictions for instances belonging to S-and DS-type, and uses other three classifiers to make predictions for D-type instances.4) This dissertation studies impact of discretization algorithms on classifiers’ decision boundaries. This dissertation proposes that the reason why discretization algorithms can improve the generalization ability of Naive Bayes classifier is that discretization algorithms can improve the Vapnik-Chervonenkis (VC) dimension of Naive Bayes classifier. This dissertation applies discretization algorithms to SVM and KNN algorithm, and discusses impact on VC dimensions of above two classifiers.5) This dissertation proposes the second division (SD) algorithm to train classifiers in decision regions of Naive Bayes classifier and studies existing algorithms of training local classifiers. The SD algorithm is a new hybrid of global learning and local learning. Thus it can improve the generalization ability of Naive Bayes classifier. Existing algorithms of training local classifiers are divided into three types, test slection, divide-and-conquer and training selection. This dissertation proposes that the reason why local classifier training algorithms can improve generalization ability of classifiers is that they can improve VC dimensions of classifiers and they can utilize more information of training data set.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2012年 07期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络