节点文献

中医医案数据挖掘技术研究

【作者】 张煜斌

【导师】 陆建峰;

【作者基本信息】 南京理工大学 , 计算机应用技术, 2009, 硕士

【摘要】 名老中医的医案是智慧的结晶,使用数据挖掘技术可以帮助我们从专家的医案中挖掘出大量隐藏的临证经验与用药规律。然而中医医案是以自由文本的形式存在的,必须先使用文本挖掘技术从自由文本中抽取出信息,构建结构化的医案,才能更好地使用数据挖掘技术来获取知识。本文首先研究了文本挖掘技术中的文本分类和信息抽取这两个技术,并将这些技术应用于名老中医医案结构化研究中。对于上述结构化医案,采用数据挖掘方法挖掘出其中的一些临证经验。本文研究内容如下:1.研究了基于字特征的中文文本分类技术。采用了信息增益(IG)技术进行特征选择,用余弦相似度来度量文档间的相似性,采用KNN分类器,在基于复旦大学新闻语料库的实验中,文本分类的正确率达到86.92%,宏平均分类性能达到接近87%的水平。实验结果表明字特征是中文文本分类特征建模中的一种有效方法。2.研究了中文文本信息抽取技术。针对名老中医医案,采用了Meta-Bootstrapping算法来提取术语,并设计了术语抽取中所需的模式结构。该方法无需任何浅层自然语言处理和语料标注,仅需提供少量的种子词,经过一定的迭代次数,就可以完成术语抽取任务。在对某名医206份医案的术语抽取实验中,方剂名,辨证信息和治则的术语抽取实验F1-测度值分别为64.29%,56.21%和76.64%。在抽取术语的基础上,完成了医案结构化的实验。3.基于文本分类和信息抽取处理后的病案,本文就名老中医临证经验挖掘系统中的数据预处理模块进行了深入研究,为后续数据挖掘工作的进行提供了清洁的,结构化的源数据。4.基于预处理后的症状信息,完成了慢性胃炎辨证过程的建模研究。采用基于因子分析的方法对现有的隐结构模型进行改进,改进了模型的准确性和训练速度。5.基于预处理后的处方信息,完成了药物量效关系研究。设计并实现了基于加权欧式距离的层次聚类算法。以某名医哮喘医案数据为例,挖掘了药物使用的规律并得到合理的解释。

【Abstract】 The medical records of TCM(Traditional Chinese Medicine) experts are crystallization of famous herbalist doctors’s experience, Data Mining(DM) can help us to get the clinical experience of the famous herbalist doctors and their medicine law. However, the medical records are usually in the form of unstructured data, in order to mine such data, Text Mining technology should be used to extract information from such so as to structuralize the medical records, which is the foundation for mining.In this thesis, Text Mining technology is researched first, which focuses on the Text Classification and Information Extraction. Then, these techniques are applied to structuralize medical records of famous herbalist doctors. Based on above structuralized medical records, some data mining methods are used to mine some clinic experience. Concrete research work is as follows:1. The study of Chinese text classification based on character feature. The techniques of Information Gain is applied to select features, cosine distance to measure the similarity between documents, and KNN methods as classifier, a systematic comparative experiments have been conducted on the news corpus from Fudan University, which achieves the 86.92% precision and 87% Macro-F score. The experimental results indicate that character based feature is an effective modeling method for Chinese text classification.2. The study of information extraction to extract the terms from clinical medical records. For structured medical records, it adopted the Meta-Bootstrapping algorithm to extract terms, meanwhile the pattern structure was designed for this purpose. The algorithm began with a few seed words provided artificially, after several iterations, term extraction can be accomplished, which featured no need of any shallow Chinese NLP techniques and labeled training corpus. The experiments are carried out on the 206 clinical medical records, the names of prescription, the dialectical information and the rules of treatment are extracted, F1 score achieved 64.29%, 56.21% and 76.64% respectively. On the basis of term extraction, unstructured medical records are converted into structured records.3. Based on medical records processed by text classification and information extraction, data preprocessing for Data Mining system of Traditional Chinese Medicine has been researched, which provide clean, structured data for the subsequent mining work.4. Based on the structured symptom information in medical records, a latent structure of syndrome differentiation of chronic gastritis has been researched. The improvement was made on current latent structure based on the factor analysis, which improved the accuracy of model and training speed.5. Based on structured prescriptions, the dose-effect relations of Chinese medicine has been mined. An agglomerative clustering algorithm based on weighted Euclidean Distance has been designed and implemented. The experiment on the Asthmatic Clinical Records of a famous herbalist doctor shows the essentials of his experience and has been well supported by the theory of Traditional Chinese Medicine.

节点文献中: