节点文献

朴素贝叶斯分类器及其改进算法研究

Research on Naive Bayes Classifiers and Its Improved Algorithms

【作者】 蒋良孝

【导师】 王典洪; 蔡之华;

【作者基本信息】 中国地质大学 , 地球探测与信息技术, 2009, 博士

【摘要】 分类是数据挖掘中一项非常重要的任务,在现实生活中有着广泛的应用。例如,根据电子邮件的标题和内容判断其是否为垃圾邮件。构造分类器的方法很多,常见的有贝叶斯网络、决策树、基于实例的学习、人工神经网络、支持向量机、遗传算法、粗糙集、模糊集等等。其中,贝叶斯网络正以其独特的不确定性知识表达形式、丰富的概率表达能力、综合先验知识的增量学习特性等成为众多方法中最为流行的方法之一。鉴于学习最优的贝叶斯分类器如同学习贝叶斯网络是一个NP难问题,学习朴素贝叶斯分类器得到了广大学者的重视。朴素贝叶斯分类器基于一个简单而不现实的假设:在给定类标记时属性值之间相互条件独立。可最近的有导师学习表明:即便是这样一个惊奇简单且具有很强的属性条件独立性假设的贝叶斯分类器,简称为朴素贝叶斯分类器,其分类性能仍然可与决策树算法、k-近邻算法等经典算法相当。一个自然的问题是:释放朴素贝叶斯分类器的属性条件独立性是否可以使得它的分类性能更好?为回答这个问题,学者们提出了许多改进朴素贝叶斯分类器的方法,概括起来主要可以分为三类:1)结构扩展,这一类方法用有向边来表达属性之间的依赖关系;2)属性选择,这一类方法在属性空间搜索一个属性子集;3)局部学习,这一类方法在测试实例的局部构建一个朴素贝叶斯分类器。本文以朴素贝叶斯分类器为基本对象,研究朴素贝叶斯分类器的各种改进方法,提出了隐藏扩展的朴素贝叶斯分类器、演化选择的朴素贝叶斯分类器、动态局部的朴素贝叶斯分类器三种算法。在许多现实的数据挖掘应用中,排列也非常重要。因此,本文调查研究了朴素贝叶斯分类器的排列性能,并提出了一种局部克隆的朴素贝叶斯排列算法。此外,本文还调查研究了改进朴素贝叶斯分类器的一些其他方法:属性加权方法、实例加权方法、组合学习方法,提出了一种基于相似度的实例加权的朴素贝叶斯分类算法和一种基于C4.5和NB的组合分类算法。最后,探讨了新算法在若干实际问题的应用价值。本文的最主要的贡献包括:1)给出了学习扩展的朴素贝叶斯分类器的算法框架、综述了改进朴素贝叶斯分类器的结构扩展方法、提出了一种隐藏扩展的朴素贝叶斯分类算法(HANB)。HANB为每个属性结点产生一个隐藏的父亲结点,该结点对其儿子结点的影响为其他所有属性结点对该属性结点影响的加权平均,其中权值的大小为属性变量之间的条件相互信息。2)给出了学习选择的朴素贝叶斯分类器的算法框架、综述了改进朴素贝叶斯分类器的属性选择方法、提出了一种演化选择的朴素贝叶斯分类算法(ESNB)。ESNB的适应度函数为当前朴素贝叶斯分类器的分类精度。编码方式为二进制编码方式,二进制串的长度为原始属性的个数,二进制位“1”或者“0”分别代表属性被选择或没有被选择的状态,停止搜索的条件为演化的代数。3)给出了学习局部的朴素贝叶斯分类器的算法框架、综述了改进朴素贝叶斯分类器的局部学习方法、提出了一种动态局部的朴素贝叶斯分类算法(DLNB)。DLNB在训练实例集上利用留一交叉验证法来动态地选择一个最能拟合训练实例集的k值,一旦最佳的k值被学习到,它就可以被用来分类所有的测试实例。4)综述了排列算法的研究状况、调查了朴素贝叶斯分类器的排列性能、提出了一种局部克隆的朴素贝叶斯排列算法(LCNB)。LCNB首先运用k-近邻算法发现最接近测试实例的k个邻居,然后根据测试实例和每个邻居之间的相似度对每个邻居进行克隆,最后在增加了克隆实例后的训练实例集上构建朴素贝叶斯分类器。5)给出了学习属性加权和实例加权的朴素贝叶斯分类器的算法框架、综述了构造组合分类器的四类方法、提出了一种基于相似度的实例加权的朴素贝叶斯分类算法(IWNB-S)和一种基于C4.5和NB的组合分类算法(C4.5-NB)。6)探讨了新算法(HANB、ESNB、DLNB)在若干实际问题的应用价值。

【Abstract】 Classification is one of very important tasks in data mining, and is widely used in real-world applications. For example, it can be used to judge whether a mail is a junk mail or not according to its tittle and content. There exist a lot of methods for building classifiers, such as Bayesian networks, decision trees, instance-based learning algorithms, artificial neural networks, support vector machines, genetic algorithms, rough sets, fuzzy sets, and so no. Amoung these mthods, Bayesian networks is the most popular one thanks to its special form for presenting uncertain knowledge, abundant ability for presenting probability, and incremental learning characteristic for synthesizing prior knowledge.Because learning an optimal Bayes classifier just like learning a Bayesian network and is an NP-hard problem, learning naive Bayes classifiers has attracted much attention from researchers. The naive Bayes classifiers is based on a simple but unrealistic assumption that the attribute values are conditionally independent given the class label. Recent work in supervised learning has shown that a surprisingly simple Bayesian classifier with strong assumptions of independence among attributes, called naive Bayes, is competitive with state-of-the-art classifiers such as C4.5 and the k-nearest neighbor algorithm.A natural question is: whether relaxing the attribute conditional independence assumption of the naive Bayes classifiers can scale up its classification performance or not. In order to answer this question, researchers presented a lot of improved methods, which can be broadly divided into three main categories: 1) Structure augmentation, this kind of approach uses directed arcs to represent the dependencies among attributes. 2) Attribute selection, this kind of approach selects an attribute subset from the whole space of attributes. 3) Local learning, this kind of approach builds a local naive Bayes classifier for a test instance.This thesis takes naive Bayes classifiers as the basic object, and researchs all kinds of improved methods of naive Bayes classifiers. Besides, three classifers, hidden augmented naive Bayes classifier, evolutional selective naive Bayes classifier, and dynamic local naive Bayes classifier, are presented in this thesis. In many real-world data mining applications, a ranking of test instances also is very important. Thus, this thesis surveys the ranking performance of the naive Bayes classifiers and presents a locally cloned naive Bayes ranking algorithm. Besides, this thesis studies some other methods for improving naive Bayes classifiers: attribute weighting, instance weighting, combining learning, and presents an instance weighted naive Bayes classification algorithm based on similarity and a combined classification algorithm based on C4.5 and NB. At last, this thesis explores the application values of the new algorithms in severeal real-world problems.The main contributions of this thesis include:1) This thesis gives the algorithm framework for learning augmented naive Bayes classifiers, reviews all kinds of structure augmentation methods for improving naive Bayes classifiers, and presents a hidden augmented naive Bayes classifier (simply HANB). HANB creates a hidden parent for each attribute node, which combines the influences from all the other attribute nodes by weightily averaging the conditional mutual information between each pair of attributes.2) This thesis gives the algorithm framework for learning selective naive Bayes classifiers, reviews all kinds of attribute selection methods for improving naive Bayes classifiers, and presents an evolutional selective naive Bayes classifier (simply ESNB). ESNB firstly uses an evolutional algorithm to select an attribute subset from the whole space of attributes, and then builds a naive Bayes classifier on the selected attributes. In ENB, the classification accuracy of naive Bayes classifiers is choosed as the fitness function, and the binary encoding method is used. In each binary string, the length is the number of original attributes, the bit of "0" or "1" respevtively means the status of an attribute being selected or not, and the condition for stopping selection is the maximum number of generations.3) This thesis gives the algorithm framework for learning local naive Bayes classifiers, reviews all kinds of local learning methods for improving naive Bayes classifiers, and presents a dynamic local naive Bayes classifier (simply DLNB). DLNB uses a brute-force search based on leave-one-out cross-validation to dynamically select a best k value for fitting the training data is learned at training time. Once the best k value is learned, it can be used to classify all test instances.4) This thesis reviews the reasearch status on ranking, surveys the ranking performance of the naive Bayes classifiers and presents a locally cloned naive Bayes ranking algorithm (simply LCNB). LCNB firstly applies the k-nearest neighbor algorithm to find k nearest neighbors of a test instance, and then clones each nearest neighbor according to the similarity between it and the test instance. At last, LCNB builds a naive Bayes classifier on the training instance set in which some clones have already been added.5) This thesis gives the algorithm framework for learning attribute weighted and instance weighted naive Bayes classifiers, reviews four kinds of methods of building combined classifiers, and presents an instance weighted naive Bayes classification algorithm based on similarity and a combined classification algorithm based on C4.5 and NB.6) This thesis explores the application valus of the new algorithms (HANB、ESNB、DLNB) in severeal real-world problems.

节点文献中: