节点文献

贝叶斯网络结构学习算法研究与应用

Algorithm Research and Application in Structure Learning of Bayesian Networks

【作者】 孙岩

【导师】 唐一源;

【作者基本信息】 大连理工大学 , 计算机应用技术, 2010, 博士

【摘要】 贝叶斯网络(Bayesian Network, BN)将概率论和图论有机结合,用一种图形化的方式表示联合概率分布。具有完备的语义和坚实的理论基础,目前已经成为处理不确定性知识表示和推理的一种重要理论模型。贝叶斯网络在机器学习、医疗诊断、金融分析等领域有着广泛的应用。并已经取得了较大的成功。但仅由专家诊断构建贝叶斯网络通常十分困难,有时甚至是不可能的。因此,如何从数据中快速、准确地学习贝叶斯网络结构,并把它应用到实际领域中,具有重要的理论意义和应用价值。本论文在研究国内外算法的基础上,针对贝叶斯网络的相关算法和不完整数据的学习问题进行深入研究,提出改进算法,并将算法应用于轻度认知障碍和脑血管疾病风险因子预测的实际需求中,开展的主要工作如下:1.最近邻KNN(K-Nearest Neighbour)算法被广泛应用于机器学习和数据挖掘领域,本文将贝叶斯网络的结构学习与KNN算法相结合,提出了基于贝叶斯网络结构学习的KNN算法(BS-KNN),把贝叶斯网络结构学习的结果作为改进KNN算法中相似性的评测指标,概率系数越大,其相应的特征越重要,对分类结果的影响越大。实验结果表明,新算法的复杂度与同类算法相当,在数据集中属性特征较多及样本量较大的条件下,算法的准确性和稳定性均有所提高。2.数据不完整的情况经常发生,这将导致贝叶斯网络结构学习算法精度不高的问题,基于此,本文提出基于几何分布和KL散度相结合的贝叶斯网络结构学习算法,能够完成从不完整数据中学习贝叶斯网络的结构特征。该算法首先用几何分布表示结点之间的对应关系,然后用KL散度来度量对应关系的相似程度,进而确定不完整数据的取值,最后进行完整数据的贝叶斯网络结构学习。该方法能够避免标准Gibbs sampling的指数复杂性问题和现有学习方法存在的主要问题。3.轻度认知障碍目前被认为是正常衰老向痴呆转化的中间过程,其相关研究对于老年痴呆症的预防和干预有着非常重要的意义。本论文采用记忆、注意和人口统计学数据,提出不完整数据的贝叶斯网络结构学习新算法:首先利用互信息获得属性特征的重要程度,从而找到与不完整数据最相似的样本集,接着采用牛顿插值来得到不完整数据的取值,最后进行完整数据下,轻度认知障碍的贝叶斯网络结构学习,对该病症进行预测和辅助诊断,发现其主要影响因素及其相互作用关系,从而很大程度上减少患者进行检查的代价,提高诊断的客观性。临床的实验结果表明,本论文的方法获得了较好的效果。4.脑血管疾病具有高发病率、高致残率、高死亡率和高复发率的特征,因此研究脑血管疾病相关风险因子的预测,具有非常重要的意义。本文结合信息增益技术,确定结点次序的启发式搜索,来对现有的贝叶斯网络结构学习算法进行改进,并利用该算法分析和探讨脑血管病危险因素(年龄、性别、高血压病、糖尿病、心脏病和高血脂)之间非线性的概率依赖关系,预测脑血管病的发病风险,从而进一步指导其预防和治疗。实验结果表明,该模型能够客观有效的辅助鉴别脑血管疾病的风险因子。

【Abstract】 Bayesian network (BN) is a graphical representation for probability distributions. Because of its well-defined semantic and solid theoretical foundations, it became an important theory model in the community of artificial intelligence, and also a powerful formalism to encode the uncertainty knowledge; BN has been applied in the fields such as machine learning, medical diagnoses, financial market analysis, and achieves a great success. Usually, it is difficult to construct a Bayesian network only by the domain expert. Therefore, fast and efficient learning from data is very meaningful to its research and application.Based on the domestic and foreign algorithms, this dissertation deeply researches on the related algorithms of the Bayesian network structure learning, and applies these algorithms to real demands, such as, predicting the risk factors of the mild cognitive impairment and the cerebrovascular diseases. The main work is as following:1. K-Nearest Neighbor algorithm (KNN) is one of the classification algorithms widely used in the filed of machine learning and data mining. This paper combines the Bayesian network structure learning algorithm with KNN algorithm (named as BS-KNN). This algorithm improves the evaluation performance in similarity of KNN algorithm. The probility coefficient is higher, the corresponding feature is more important; the effect on the classification is bigger. The experimental results indicate that new algorithm is same with related algorithms in complexity, but the accuracy and stability of the algorithm are improved when there are many numbers of features and bigger sample size in the data set.2. The data incomplete situation occurs frequently, which will cause the accuracy of algorithm is not higher. A new Bayesian network structure learning algorithm based on geometric distribution is presented, which combines geometric districbution with kullback-leibler (KL) divergence, and learns directly Bayesian network from incomplete data. Firstly, using geometric distribution denotes corresponding relationships between nodes. Secondly, using KL divergence expresses the similarity between the relationships. Finally, the estimation of incomplete data is gotten. The algorithm can avoid the problem of exponential complexity in the standard Gibbs sample. The comparison with other related algorithms indicate that the new algorithm has higher accuracy in the most situations.3. Mild Cognitive Impairment (MCI) is thought to be the prodromal phase to Alzheimer’s disease (AD), which is the most common form of dementia and leads to irreversible neurogenerative damage of the brain. It is very important to research the related methods for the prevention and treatment of the AD. MCI is not easy to diagnose and need professional doctor make comprehensive diagnosis based on clinical experience. The MNBN algorithm is presented and constructs Bayesian network adopting memeory, attention and demography data, which will decrease the costs of examination in most extent, and increase the objectivity of diagnosis. The clinical experimental results show that the MNBN algorithm gets better effectiveness.4. Cerebrovascular diseases (CeVD) represent a major cause of morbidity and mortality worldwide. It is reported in the literature that CeVD is one of the three major causes of death in human disease. Therefore, it is great significance to strength the survey of risk factors for the CeVD. Firstly, emoploying natural demography information and some physiological index as the risk factors of the CeVD analyze the mutual relationships among them. Secondly, combining the information gain technology makes sure prior sequence of nodes, constructs the Bayesian network, and further researches the probabilistic dependency relationships among the risk factors. Finally, the experiments are done adopting benchmark dataset. Compared with related algorithms, the experimental results show that the model can identify assistantly the risk factors of CeVD in objectivity and effectivity.

节点文献中: