节点文献

基于贝叶斯网数据挖掘若干问题研究

Research on Bayesian Network Based Approach for Data Mining

【作者】 关菁华

【导师】 刘大有;

【作者基本信息】 吉林大学 , 计算机应用技术, 2009, 博士

【摘要】 贝叶斯网以其丰富的概率表达能力、灵活的推理能力、综合先验知识的特性及其坚实的理论基础引起科研与应用人员的广泛关注,涌现出大量的基于贝叶斯网模型的数据挖掘方法。本文以基于贝叶斯网的数据挖掘方法中存在的若干研究问题为背景,对分类问题、聚类问题以及增量学习中的概念漂移问题等进行了研究,重点研究基于贝叶斯网的解决方法。本论文的主要内容包括:1)在比较和分析了现有四种经典贝叶斯网分类器的基础上,提出了基于实例选择的贝叶斯网分类器集成算法。该方法选择当前测试实例在训练集中的k最近邻作为验证集,根据各分类器在此验证集上的分类准确性,确定单个分类器的权重,并采用加权投票法进行结果组合,以提高分类准确性;2)针对增量学习中的概念漂移问题,提出一种自适应集成学习算法AMCE,该算法中各个分类器的权重可独立地进行调整,以增强自适应能力;采用剪枝策略对冗余的个体分类器进行约简,以提高集成的泛化性能;提出了基于方向选择的分类器集成算法OSEN,以降低参与集成的个体分类器数目,并提高集成的泛化性能;提出采用遗传算法从当前集成分类器中选择部分个体分类器参与集成,以降低集成的泛化误差;3)针对朴素贝叶斯聚类问题,提出基于离散粒子群的朴素贝叶斯混合聚类算法HDPSO。该算法具有较好的全局搜索性能,并混合EM算法对单个粒子进行局部寻优,以提高算法的收敛速度。通过大量实验验证了本文所提算法的有效性和实用价值。本文的工作预期对国内数据挖掘领域该领域的发展起到一定推进作用,本文对概念漂移数据的分类方法研究方面特色鲜明,具有较高的理论意义和实际应用价值。

【Abstract】 With the development of computer technology and the applications of computer network, the data generated daily is also increasing at a unprecedented speed. How to manage or use these data, and to transform them into useful information and knowledge for decision-making, becomes an important problem. Data Mining arise in response to this demand. In short, Data Mining is the process to search useful hidden knowledge that stored in the large database. It makes use of statistics and Artificial Intelligence technology to handle and analysis data, detect hidden knowledge and construct different models according to practical problem; as a result it provides a reference basis for decision-making.Bayesian Network as a key technology to deal with uncertainty problem was paid close attention by more and more researchers and application developer because of its following features: first of all, it has rich power to express probability knowledge; second, it can synthesize prior knowledge, and realize incremental learning; third, it has a solid mathematical foundation. Therefore a large number of Bayesian Network-based data mining methods emerged. Although Bayesian Network performs well in model diagnosis and other area, there are still some issues to be resolved. For example, it is a NP-hard problem to construct unconstrained Bayesian Network, how to deal with data stream that has concept-drifting is also a problem.The main results and contributions included the algorithm based on Bayesian Network of classification and clustering, as well as dynamic integration approach for ensembles to handle concept drift. The main achievements are as follows:(1) The task of classification is to assign one of predefined classes to instances described by a set of attribute (feature) values. Classification analysis is a very active research topic in machine learning and data mining. Classification can be applied to a wide variety of domains: pattern recognition, forecasting, medical diagnosing, fault diagnosis, loan applications, user modeling, biomedical informatics, etc. At present, a great many of techniques have been developed based on Logical techniques(decision tree,rules-based techniques), based on Perceptron, based on Statistics(Bayesian Networks, SVM) and based on Instance(lazy learning).In the recent years many researchers pay attention to Bayesian networks which are efficient for classification. Naive Bayes is a kind of simple Bayesian networks with a strong assumption of independence among attribute nodes in the context of the class node.This thesis empirically evaluates algorithms for learning four types of Bayesian Networks classifier: NB, TAN ,BAN and GBN, where the latter two are constructed by using BN structure learning algorithm DSBNL. This thesis implemented these learners to test their performance in classification problems. This thesis also argued some factors that influence classifier’s accuracy, such as the thresholdεand training sample size. The experiments show that the larger the data sets are the fewer the prediction errors and that for some data sets this algorithm cannot get the best accuracy with fixed threshold. The analysis of our experimental results suggests two ways to improve predictive performance of BN classifiers. Firstly, this thesis uses threshold selection based on the prediction accuracy to void overfitting and underfitting in construction of classification models. Secondly, This thesis proposed example selection based integration algorithm of NB, TAN and BAN with weight. This thesis demonstrates empirically that this integration algorithm does work as expected. Collectively, these results argue that BN classifiers deserve more attention in machine learning and data mining communities.(2) More and more data in the real world are dynamic data, rather than the static data. The pattern uncoverd in the data may change more or less as the time goes by. Incremental learning ability becomes important for Data Mining algorithm. Concept-drifting is a common problem in incremental learning. In order to adapt the variation of concept, we need to improve the adaptability of the system, but stability of the algorithm will decline in the meantime. On the contrary, we need to improve the stability of the system so that it can make better use of historical information and knowledge, now the adaptability will be affected. In consequence, it is hard to learn correctly when concept-drifting occurred.In the real world, concepts are often not stable but change with time, which is known as the problem of concept drift. Concept drifts complicate the task of learning and require unusual solutions, different from commonly used batch learning techniques. We consider synthetically generated concept drifts and an example of concept drift from the area of antibiotic resistance. Among the most popular and effective approaches to handling concept drift is ensemble learning, where a set of concept descriptions built on data blocks corresponding to different time intervals is maintained, and the final prediction is the aggregated prediction of ensemble members At present, the problem of concept drift has received considerable attention. There are two approaches to handling concept drift: instance based approach and ensemble learning approach. The literature on concept drift shows that ensemble learning is the most promising approach. In this thesis, we propose an ensemble learning algorithm AMCE(Adaptive Multiple Classifiers Ensemble) which can adaptively adjusts weights of classifiers. In order to improve the adaptive ability, this novel algorithm distributes parameters for adjusting weight to each classifier respectively, and dynamically adjusts the parameters during online learning to maximize performance. We also use pruning method based on KL divergence to eliminate redundancy classifiers. Experimental results show that the proposed algorithm improves the predictive accuracy, speed and reduces memory space, and is effective in tracking concept drift with noise.In data streams concept are often not stable but change with time. This thesis proposed a selective integration algorithm OSEN (Orientation based Selected ENsemble) for handling concept drift data streams. This algorithm selects a near optimal subset of base classifiers based on the output of each base classifier on validation dataset. The experiments with synthetic data sets simulating abrupt (SEA) and gradual (Hyperplane) concept drifts demonstrate that selective integration of classifiers built over small time intervals or fixed-sized data blocks can be significantly better than majority voting and weighted voting, which are currently the most commonly used integration techniques for handling concept drift with ensembles. This thesis also explained the working mechanism of OSEN from error-ambiguity decomposition. Based on experiments, OSEN improves the generalization ability through reducing the average generalization error of the base classifiers constituting the ensembles.This thesis proposed a selective integration algorithm DGASEN (Dynamic GASEN) for handling concept drift data streams. This algorithm selects a near optimal subset of base classifiers based on the output of each base classifier on validation dataset. The experiments show that DGASEN improves the generalization ability.(3) Clustering is the unsupervised classification of patterns into clusters, because unlike classification (known as supervised learning), no a priori labeling of some patterns is available to use in categorizing others and inferring the cluster structure of the whole data. The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in data analysis. Many algorithms have been proposed, such as model-based algorithm, distance-based algorithm, density-based algorithm and deviation-based algorithm. In this paper, we concentrate on the research of the model-based algorithm. The most frequently used approaches include mixture density models mixture models and bayesian networks.This thesis has described how to apply DPSO method to parameter estimation for NB clustering effectively. This thesis found that the DPSO algorithm converges much slowly than the EM algorithm according to experimental results, but it can get better global optimal solution. At the same time, the EM method is sensitive to initial solution and easy to get local optima. Because DPSO and EM algorithms have their own drawbacks respectively, in order to improve the efficiency of DPSO, local search algorithm-EM is introduced into the traditional DPSO method. The EM algorithm makes every particle can find the local optimal solution in current space. This local search process improves the performance of swarm.The research results of the thesis will greatly enrich and push the studies of the algorithm of clustering and classification based on Bayesian Network, as well as the algorithm to solving concept drift problem in both theoretical and technological aspects.

  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2010年 07期
节点文献中: