节点文献

主题模型及其在中医临床诊疗中的应用研究

Study on Topic Model and Its Application to TCM Clinical Diagnosis and Treatment

【作者】 张小平

【导师】 黄厚宽;

【作者基本信息】 北京交通大学 , 计算机应用技术, 2011, 博士

【摘要】 主题模型(Topic Model)能够提取隐含在文档(或其它离散数据集)中的主题,其中每个主题是语义相关的词上的多项式分布。主题模型的主要目的是提取数据集中隐含的统计规律且利用主题进行直观表达,然后可以利用获得的主题进行信息检索、分类、聚类、摘要提取以及进行信息间相似性、相关性判断等一系列应用。近年来,主题模型已逐渐成为文本挖掘、信息检索等领域的一个新的研究方向。中国传统医学(简称中医)作为传统生命科学的一个重要组成部分,在疾病诊疗方面具有特色和显著的临床疗效。几千年的中医诊疗实践积累了大量的临床数据,这些数据中包含着丰富的符合中医理论的知识与规律。在中医信息化建设的背景下,利用现代化手段挖掘隐藏在这些临床数据中的中医诊疗规律具有重要意义。随着数据挖掘技术的逐渐成熟和广泛应用,利用数据挖掘等手段,分析挖掘中医诊疗规律已逐渐成为国内中医理论研究热点。近年来,研究人员应用聚类分析、关联规则以及回归分析和判别分析等方法研究中医理论,并已取得了一定的研究进展,但是,仍然难以体现中医的语义复杂性特点以及中医诊疗的系统性特点。本文首次尝试把主题模型引入中医临床诊疗规律的研究中。使用主题模型的动机是我们不仅认为主题模型能够捕获中医临床诊疗数据集中的语义特征,而且认为主题模型中的主题推理及生成过程与《伤寒论》所述的“观其脉症,知犯何逆,随证治之”的中医辨证论治过程基本一致,都是由显变量到隐变量再到显变量的过程。本文利用主题模型分析了2型糖尿病、冠心病的临床诊疗数据以及中医文献数据。实验表明,利用主题模型能够提取出有临床意义的中医诊疗规律,为中医临床研究提供一种新颖的理论方法,为中医临床辨证治疗提供一种客观依据。本文的主要工作如下:(1)以隐狄利克雷分配(Latent Dirichlet Allocation, LDA)模型为代表的主题模型,是近年来文本挖掘和信息检索等领域的一个新的研究热点。本文系统地对主题模型的产生背景、发展过程、LDA主题模型常用的推理方法以及典型的主题模型进行归纳总结。为本文的研究奠定基础,为相关研究人员在主题模型领域的应用研究提供较系统的参考依据。(2)提出LDA主题模型的特征加权机制。我们直接采用LDA主题模型分析中医临床症状主题时,发现主题分布向高频词倾斜,能够代表主题特征的词被少量的高频词淹没,导致主题的解释性和区分性不佳,而且在建模过程中影响其它词在主题上的合理分配。于是,针对标准文本数据,采用倒排文档频率(Inverse Document Frequency, IDF)进行特征加权;针对中医临床数据,提出一种新颖的高斯函数特征加权方法。实验表明:加权LDA主题模型能够提高主题间的区分能力、提高主题的可解释性以及提高主题模型的建模速度;在Newsgroups标准数据集上,利用建模后的主题作为特征进行支持向量机(Support Vector Machine, SVM)分类时,能够提高分类准确率(Accuracy);能够在一定条件下,降低模型的困惑度/复杂度(Perplexity)。(3)针对LDA主题模型不能自动确定主题数目的问题,提出一种结合词相似性与中国餐馆过程(Chinese Restaurant Process, CRP)的主题模型;同时,针对LDA主题模型的Gibbs抽样近似推理中的两个Dirichlet超参数难以合理设置的问题,提出一种新颖的超参数设置方法。实验表明:提出的模型可以自适应地动态更新主题内容,确定合理的主题数目;超参数的设置能够方便灵活地适应不同的数据集,取得较低的模型复杂度。(4)分析主题模型和中医辨证论治的联系,在LDA模型和作者-主题模型的基础上,提出一种症状-中药-诊断主题模型,用于自动提取中医临床数据中症状、中药和诊断间的主题结构,系统地探索具有临床意义的多个实体间的关系。在2型糖尿病临床数据的分析实验中,获得了2型糖尿病典型的并发症/合并病(如糖尿病合并肾病,糖尿病外围神经病变等)的诊疗主题结构。实验结果分析表明:一类症状或其组合仅为人群/疾病分类找到了一种划分方式或依据,并不等同于该症状组合就对应唯一的证候或诊断,中医存在个性化诊疗特点;同时中医也存在共性的诊疗规律;提出的症状-中药-诊断主题模型能较好地揭示疾病的症状和中药分布特征以及中医诊疗规律。(5)对于一种复杂疾病(如糖尿病),通常存在多种并发症。于是,体现出的症状存在疾病主症和伴随症状间的层次关系;同时,用药也存在相应的分层关系,即对方剂进行随症加减。针对上述情况,为了揭示症状及相应用药的层次关系,本文在分层LDA模型和连接LDA模型的基础上,提出一种分层症状-中药主题模型。该模型在糖尿病临床数据的实验中,发现了有临床意义的症状分层结构和对应的用药分层规律。为探索中医临床诊疗中的方剂随症加减规律提供一种新颖的统计方法。

【Abstract】 Topic models could be used to extract topics which are hidden in the documents (or discrete corpora), where each topic is a multinomial distribution over words semantically related each other. The main purpose of topic models is to explore statistical laws hidden in the discrete corpora and to express these information directly using topics, and then the topics obtained could be used for information retrieval, classification, clustering, abstract extraction, similarity and relativity estimation and so on. Topic model has recently been a new research issue in domains of text mining and information retrieval, etc.Traditional Chinese Medicine (TCM), an important component of traditional life sciences, has significant clinical efficacy in diagnosis and treatment of diseases. Large amount of clinical data, containing lots of knowledge and rules that are consistent with TCM theory, have been accumulated during thousands of years’TCM practice. In the trend of TCM informatics, it is very important to use modern techniques for mining the rules of TCM diagnosis and treatment hidden in clinical data. Although lots of methods, such as cluster analysis, association rules, regression analysis and discriminant analysis, have been used to study TCM theory, and some research progresses have been made, it is still difficult to reflect the TCM characteristics that are semantic complexity and systematicity of diagnosis and treatment.In this dissertation, we firstly introduce topic models to the study of the rules of TCM clinical diagnosis and treatment. The motivation is that we think not only topic models could capture the semantic characteristics hidden in TCM clinical data, but also there are relatively consistent route between the process of inference and generative of topics in the topic models and the process of "syndrome differentiation and treatment" which is described as "inspect the pulse-symptom, infer the diseases, and then to treat them" in the famous book Treatise on Exogenous Febrile Diseases. Both of the routes are from observable variable to latent variable to observable variable. We apply topic models to analyzing the clinical data of type 2 diabetes mellitus (T2DM), the clinical data of coronary heart disease and the TCM literature. Experiments indicate that the topic models could extract meaningful clinical law of diagnosis and treatment. It can provide a kind of academic method for TCM clinical study, and offer a kind of impersonality foundation for TCM clinical diagnosis and treatment. The main contributions of this dissertation are as follows:(1) Topic models represented by Latent Dirichlet Allocation (LDA) are recently one of the new research focuses in the domain of text mining and information retrieval. The formed background and development process of topic models, general inference methods of LDA and some typical topic models are systematically summarized in this dissertation. These contents are the basis of the research of this thesis and the reference of other researchers in the future.(2) We propose feature weighting mechanism in LDA model. When learning TCM clinical symptom topics by original LDA model, we found that the word distributions in the topics incline to high frequence words. That means those feature words representing topics are submerged by few high frequence words, which result in somewhat poor ability of elucidation and discrimination of the topics and rational allocation of other words on the topics. Therefor, we weight for the feature words using IDF method in standard text data, and then for TCM clinical data, we propose a novel feature words weighting method by Gauss function. The experiments indicate:weighted LDA model could improve the ability of elucidation and discrimination of topics; improve the modeling speed; improve Support Vector Machine (SVM) classification accuracy in Newsgroups dataset; reduce the perplexity under appropriate condition.(3) Aiming at the problem that the number of topics can’t be automatically determined in LDA model, a latent topic model is proposed by combining the similarity between words and Chinese Restaurant Process (CRP). At the same times, aiming at the problem that hard to rationally set the two Dirichlet hyperparameters during Gibbs sample of topic models, a novel method of setting the Dirichlet hyperparameters is put forward. Experiments indicate:the proposed model could adaptively update the contents and determine the rational number of topics; the method of setting hyperparaments is conveniently fit to different datasets and the low perplexity is obtained.(4) Analyzing the relationships between topic models and TCM "syndrome differentiation and treatment", we propose Symptom-Herb-Diagnosis Topic (SHDT) model based on LDA model and Author-Topic model, to automatically extract the topic structure among symptoms, herb combinations, and to explore the common relationships among clinical meaningful multi-entity. In the clinical data of Type 2 Diabetes Mellitus (T2DM), the SHDT model capture some meaningful diagnosis and treatment topics (clusters), which clinically indicated some important medical groups corresponding to comorbidity diseases (e.g. diabetic kidney diseases and diabetic peripheral neuropathy). The experiment demonstrates:a class of symptom or the combination of symptoms only give an manner or evidence for classification of population/diease, and they could not be explain that there is distinct syndrome or diagnosis correspondingly, and there exist individualised TCM therapies. At the same time, there exist common TCM diagnosis and treatment rules. So the results demonstrate that this method is helpful for opening out the distribution character of symptoms of diseases, TCM diagnosis and treatment rules.(5) For complex disease, such as T2DM, there is much kind of comorbidity diseases. And then, there are hierarchical relationships among main symptoms and concomitant symptoms of diseases. At the same times, there is hierarchical structure among herbs to cure above disease, which means the prescription modification according to symptoms. For opening out the hierarchical latent topic structures both symptoms and their corresponding used herbs in the TCM clinical data, we propose a Hierarchical Symptom-Herb Topic (HSHT) model. The HSHT model is a combination of Hierarchical Latent Dirichlet Allocation (HLDA) model and Link Latent Dirichlet Allocation (LinkLDA) model. Using HSHT model in clinical T2DM, we get meaningful hierarchical topic structure of symptoms and corresponding herbs. We propose a novel statistical method for research TCM clinical rules of modification according to symptoms of prescriptions.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络