节点文献
中医药知识发现可靠性研究
Research on Knowledge Discovery Reliability in Traditional Chinese Medicine
【作者】 封毅;
【导师】 吴朝晖;
【作者基本信息】 浙江大学 , 计算机科学与技术, 2008, 博士
【摘要】 知识发现可靠性是知识发现领域中一个重要但容易忽视的主题。随着知识发现和数据挖掘技术的广泛应用,有一个问题逐渐引起人们的关注,即在什么条件下知识发现是可靠的,或者说在什么条件下所发现的知识是可靠的。近年来在知识发现可靠性方面的研究,大多关注于某一具体数据挖掘模型下的可靠性问题。而对于不同模型间存在的可靠性共同主题,比如数据质量、评估方法等等,迄今为止仍没有一项系统性研究。针对知识发现可靠性的共同主题,进行分阶段、系统化的总结和梳理,已成为知识发现可靠性研究的一大迫切需要。在知识发现技术所应用的各个领域,有一个领域特别需要知识发现可靠性的研究,即中医药领域。作为中华民族重要文化财富和学术成就的中医药,近年来面临着生存和发展的挑战。如何把这一挑战化为中医药发展的契机,利用知识发现技术促进中医药的跨越式发展,已成为中医药研究人员的一项重要课题。近年来的中医药信息化工作已为知识发现创造了有利条件。然而,由于中医药数据自然语言性强,数据表达涵义丰富,表达方式多样化,而且在数据质量上还面临较大问题,在具备这些特征的数据上所进行的知识发现,相比其他领域来讲,就更加需要关注和研究知识发现可靠性问题。在这一背景下,本文围绕中医药知识发现可靠性这一主题,从知识发现整个生命周期的各个阶段对可靠性因素进行探讨,提出了知识发现可靠性框架PBRF-KD。针对中医药知识发现中比较突出的可靠性问题,重点探讨中医药知识发现中的结构性因素、表达性因素和信任性因素三大问题。本文的研究工作与贡献包括如下几个方面:1)提出了基于过程的知识发现可靠性框架针对现有知识发现可靠性研究模型相关的特点,提出了一个与模型/应用无关的知识发现可靠性框架PBRF-KD,该框架采用基于过程的思路对知识发现整个流程中的各个阶段和可靠性因素进行了梳理,归纳出了7种可靠性相关因素。该框架为知识发现项目设立了整套与可靠性相关的蓝本。2)提出了结构相关的可靠性因素的优化方法分析了中医药知识发现中与结构相关的可靠性因素,主要指数据完整性。针对文本型字段的完整性问题,提出了基于顺序半相关度量的中医药文本缺失字段填补方法。针对中医药文献类别标签缺失的问题,提出了基于M-Similarity的多标签文本分类方法。3)提出了表达相关的可靠性因素的优化方法分析了中医药知识发现中与表达相关的可靠性因素,包括表达粒度和表达一致性。针对表达粒度,提出了基于规则的表达粒度细分方法。针对表达一致性,提出了基于本体的表达一致化方法。该套方法有助于提高中医药与表达相关的可靠性。4)提出了信任相关的可靠性因素的优化方法分析了中医药知识发现中与信任相关的可靠性因素,主要指数据可信度。针对中医药特有的数据可信度问题,提出了基于历史文献认可度的数据可信度衡量方法,和基于互联网知名度的数据可信度衡量方法。此外,基于这两种可信度衡量方法,提出了基于数据可信度的加权频繁模式挖掘算法,并在消渴方和脾胃方数据集上获得了有意义的结果。该套方法有助于提高中医药与信任相关的可靠性。
【Abstract】 Reliability is a key issue in knowledge discovery. However, this important topic has not yet been well explored. The wide application of knowledge discovery technology nowadays poses a significant question for the community, that under which conditions the discovery is reliable, or alternatively we may ask under which conditions, the discovered knowledge is reliable. Most existing work on this topic considers knowledge discovery reliability (KDR) under the context of some specific data mining models. However, many common reliability issues still exist among different models, such as data quality, evaluation methods, etc. Thus, it is of great necessity to conduct a systematic research on these issues.Among various application areas of knowledge discovery, there is one field that particularly needs the consideration of KDR, that is, the area of Traditional Chinese Medicine (TCM). As a complete medical knowledge system taking an indispensable role in the health care for Chinese people for several thousand years, TCM has confronted with the great pressure of development in recent years. As a methodology that is capable to extract useful pattern from data, knowledge discovery is expected to exert its great power to promote the development of TCM. However, TCM data is known to have great natural language characteristics, with various expression patterns. Besides, the data quality in TCM is still unsatisfactory. Knowledge discovery on data with such features, requires more careful consideration on the issue of KDR.This thesis is a research focusing on KDR in TCM field. A systematic discussion of reliability issues in the whole life cycle of knowledge discovery is provided, as well as a process-based KDR framework named PBRF-KD. Subsequently, we emphasize three important types of KDR factors in TCM practice, i.e., the structural factors, the representational factors, and the trustworthiness-related factors. The major work and contributions of this thesis are as follows:First, we propose a process-based KDR framework named PBRF-KD. As a first framework to the study of KDR from the process perspective, PBRF-KD provides a uniformed view and effective approach for the analysis and estimation of KDR. As a model-independent framework, PBRF-KD could be applied by data analysts in various domains to assess the KDR. The six steps and seven main factors in PBRF-KD provide a traceable way in analyzing reliability of knowledge discovery, which can be viewed as an applicable blueprint for analyzing KDR in the whole knowledge discovery process.Second, we present key structural factors with regard to KDR in TCM, and propose a series of methods to optimize the structural factors. The data completeness is analyzed as the major structural factor in TCM. For the missing value in textual attribute in TCM data, we propose an imputation method based on an order-semisensitive similarity named M-Similarity. For the missing label in medical literature, we propose a multi-label text categorization approach based on M-Similarity.Third, we present key representational factors with regard to KDR in TCM, and propose a series of methods to optimize the representational factors. The major representational factors in TCM consist of representation granularity and representation consistency. For the issue of representation granularity, we propose a rule-based method of representation granularity subdivision. For the issue of representation consistency, we propose an ontology-based method to tackle representation inconsistency.Lastly, we present key trustworthiness-related factors with regard to KDR in TCM, and propose a series of methods to optimize the trustworthiness. For the data trustworthiness issue in TCM field, we propose a trustworthiness evaluation method based on literature historical acceptance, as well as a trustworthiness evaluation method based on popularity in Web. Using these two methods to generate weights in the mining of frequent pattern, we propose a weighed frequent pattern mining method based on data trustworthiness, and get meaningful results in 2 TCM formula datasets.
【Key words】 Knowledge Discovery; Data Mining; Reliability; Knowledge Discovery Reliability; Knowledge Discovery in Traditional Chinese Medicine; Data Quality;