节点文献

一种蒙特卡罗贝叶斯分类的改进方法

An Improved Method of Monte Carlo Bayesian Classification

【作者】 秦鑫

【导师】 朱绍文;

【作者基本信息】 华中师范大学 , 电路与系统, 2004, 硕士

【摘要】 随着信息技术的发展和数据库技术的广泛应用,人们积累的信息越来越多,如何从海量的信息中提取我们感兴趣的知识,是当前社会面临的一个严峻的问题。知识发现技术随时代的发展应运而生,成为目前较热门的研究课题之一。知识发现(KDD)能够从数据库中识别出有效的、新颖的、潜在有用的、以及最终可理解的信息。数据挖掘是知识发现的一个核心环节,涉及到数据库、人工智能、数理统计、可视化、并行计算等领域。 分类是数据挖掘的一个重要内容,它通过构造一个分类函数或分类模型(也常称作分类器),把数据库中的数据项映射到给定类别中的某一个,从而能够使用该模型来预测类标号未知的对象类。在众多的分类方法中,贝叶斯分类以其简单的结构和良好的性能而备受关注。与其它分类方法不同,贝叶斯分类建立在坚实的数理统计知识基础之上,基于求解后验概率的贝叶斯定理,理论上讲它在满足其限定条件下是最优的。 蒙特卡罗是一种采用统计抽样理论近似求解数学或物理问题的方法,它在用于解决贝叶斯分类时,首先根据已知的先验概率获得各个类标号未知类的条件概率分布,然后利用某种抽样器,分别得到满足这些条件分布的随机数据,最后统计这些随机数据,就可以得到各个类标号未知类的后验概率分布。运行一个特定的马尔可夫链可以容易地获得满足某个特定分布的随机抽样,所以马尔可夫链蒙特卡罗(MCMC)是最常用的蒙特卡罗贝叶斯分类方法。 MCMC可以减少数据挖掘中的时间和空间开销,但对于巨型数据集,MCMC在计算方面也不切实际。本文通过改进MCMC算法,使它能够用于巨型数据集的挖掘。该算法对数据集进行划分,改变MCMC对数据集的扫描策略,将其分开为内、外两个循环过程,外循环中扫描数据集,内循环扫描分布函数的抽样值。另外,本文还评估了抽样效率和有效抽样容量等问题,使用了极小量过滤方法,进一步增强了对巨型数据集的数据挖掘的实际操作能力。

【Abstract】 With the development of information technology and databases’ wide use, more and more information is accumulated, and how to find out interesting knowledge from it is a serious problem of our society. Technolegy of knowledge discovery emerge as times require, and become one of the hot research projects. KDD (Knowledge discovery in databases) can find out the effective, novel, latent, and apprehensible information. Data mining is the key step of KDD, which concerns on database, artificial intelligence, and statistics, etc.Classification is the important content of data mining, which assigns dataitems in databases to a special class by constructing a classification function or model (also be called classifier). Therefore, we can predict the unlabelled object classes with the classification model. Unlike other classifications, Bayesian classification bases on mathematics and statistics, and its foundation is Bayesian theory, which answers the posterior probability. Theoretically speaking, it would be the best solution when its limitation is satisfied.Monte Carlo is a method that approximately solves mathematic or physical problems by statistical sampling theory. When comes to Bayesian classification, it firstly gets the conditional probability distribution of the unlabelled classes based on the known prior probability. Then, it uses some kind of sampler to get the stochastic data that satisfy the distribution as noted just before one by one. At last, it can obtain the posterior probability distibution of each unlabelled classes by analysing these stochastic data. It is easy to get a stochastic sample that satisfies some special distribution through running a special Markov chain, so MCMC (Markov Chain Monte Carlo) is the most common Monte Carlo Bayesian method.MCMC method can reduce the costs of time and space in data mining, but it is impracticable in massive datasets’ computation. This thesis improves the MCMC method so that it can be adapted to massive datasets’ data mining. Our proposed approach is to split the dataset sample into two parts and change the strategy ofscanning datasets into two loop, the inner loop and the outer loop. The scan of the dataset will become the outer loop and the scan of the draws from the posterior distribution. Furthermore, this thesis not only evaluates the sampling efficiency and the effective sample size, but also enhances the practical operation capability of massive datasets’ dataming through particle filtering.

  • 【分类号】TP311.13
  • 【被引频次】6
  • 【下载频次】620
节点文献中: 

本文链接的文献网络图示:

本文的引文网络