节点文献

数据挖掘方法研究及其在中药复方配伍分析中的应用

Data-Mining Methods Study and Its Application in Tranditional Chinese Prescription Compatibility Analysis

【作者】 李力

【导师】 靳蕃;

【作者基本信息】 西南交通大学 , 交通信息工程及控制, 2003, 博士

【摘要】 中国医学(Traditional Chinese Medicine,TCM)源远流长,对中华民族的繁荣昌盛作了重要贡献。中药(Traditional Chinese Prescription,TCP)是祖国医学的重要组成部份,仅历史文献就记录有十余万首方剂。利用现代信息技术,特别是数据挖掘技术,对中药方剂配伍进行发掘是中医药现代化的重要方法。数据挖掘技术是解决机器学习、模式识别、数据库技术等各种领域中的大型实际应用问题而提出的科学方法的集合,主要是为了从大型数据库中高效地发现隐含在其中的知识或规律,并为人类专家的决策提供支持。 本论文围绕国家项目,着重研究了中药方剂数据挖掘的方法,并用这些方法对中药方剂配伍进行了初分析,主要包含以下工作: 频繁项集挖掘是数据挖掘中一个重要领域。一些频繁项集挖掘方法是基于Apriori方法,采取产生候选集-测试策略且需不断扫描数据库,时间消耗较大。FP-growth是一种不产生候选集的重要的频繁项集挖掘方法。本文在FP-growth基础上提出一个速度更快,更易实现的改进算法FP-growth。新算法采用修改过的FP-tree和头表结构,只产生FP-tree一次,并只在每次递归时产生头表结构。新算法能获得与原算法相同的频繁项集挖掘结果,但仿真实验表明,FP-growth在速度上比FP-growth至少快一倍。 提出基于图的关联规则挖掘算法GRG(Graph based method for association Rules Generation)。频繁闭项集是频繁项集的子集,但包含与频繁项集相同的信息。GRG算法构造关联图代表频繁项之间的频繁关系,并递归地从关联图中产生频繁闭项集。GRG构造频繁项集的格关系图并在它的关系上生成关联规则。GRG算法只扫描数据库两次,不产生候选集,并在速度和伸缩性上有良好性能。 提出基于FP-growth的并行频繁项集挖掘算法PFP-growth(Parallel FP-growth)。PFP-growth算法将挖掘任务均匀地分布在并行处理机上,在挖掘过程中采用一定划分策略以获得处理机间的任务平衡,并采用适当的数据结构减少并行处理机间数据通信量。在国家高性能计算机上的仿真实验证明本算法是一种有效的并行算法。 提出基于SQL粗糙集基本计算方法,包括求等价类,求正域等。重要性评价是药物筛选的一个重要方法。提出粗糙集的重要性评价相对、绝对重要性概念,给出并证明了绝对重要性条件。讨论了基于粗造集和基于频数统计的重要性评价差别。利用基于粗糙集的重要性评价方法对慢性乙肝中药药物类别进行了分析。第11页西南交通大学博士研究生学位论文 介绍了粗糙集数据约简概念,包括相对约简和绝对约简,并将它们统一为差别列表上的集合操作,其中差别列表是从差别矩阵引伸而来的。在此基础上提出基于蚁群系统的启发式数据约简算法。 最后论文介绍了中药方剂研究工作,包括对中药方剂历史和方法特点,中药方剂数据预处理,中药方剂数据库的建立,以及中药方剂分析系统设计。

【Abstract】 Traditional Chinese Medicine (TCM) has a long history, and makes great contribution to the prosperity of Chinese nation. The Traditional Chinese Prescription (TCP) is an important part of TCM, about hundreds thousands of prescriptions are recorded in historical literature. Mining the compatibility of TCP data using modern information technology, especially the data mining technology, is an effective way for speeding the modernization of TCM and TCP. Data mining is a collection of scientific methods proposed for solving large practical problems of machine learning, pattern recognition, database technology etc. The purpose of data mining is to discover the implicit knowledge, and to help human expert to make decision.This thesis studies the methods of TCP data mining related to national project and analyses the compatibility of TCP by using these methods.Frequent itemset mining is an important data mining area. Some of studies adopt Apriori-like candidate set generation-and-test approach. However, candidate set generation is very time-consuming. FP-growth is an important frequent itemset mining algorithm that could generate frequent itemset without candidate set. Based on the analysis of the algorithm FP-growth, this paper proposes a new algorithm FP-growth which is much faster in speed, and also easy to realize. By adopting the modified data structure of FP-tree and header table, FP-growth generates FP-tree only once and generates header table in each recursive operation, The new algorithm get the same result of frequent itemset, but the performance study in computer shows that the speed of FP-growth* is at least two times as fast as that of FP-growth.Algorithm GRG (Graph based method for association Rules Generation) is proposed for association rules mining using the frequent closed itemsets groundwork. Frequent closed itemsets are subset of frequent itemsets, but they contain all information of frequent itemsets. The new algorithm constructs an association graph to represent the frequent relationship between items, and recursively generates frequent closed itemsets based on that graph. It also constructs a lattice graph of frequent closed itemsets and generates association rules base on lattice graph. It scans the database for only two times, and avoids candidate set generation. GRG shows good performance both in speed and scale up properties.A new algorithm PFP-growth (Parallel FP-growth), which is based on the FP-growth*, is proposed for parallel frequent itemset mining. The PFP-growth distributes the task fairly among the parallel processors. Partitioning strategies are devised at different stages of the mining process to achieve balance between processors and new data structures are adopted to reduce the information transportation between processor. The experiments on national high performance parallel computer show that the PFP-growth is an efficient parallel algorithm for mining frequent itemset.The SQL based rough set computation methods including equivalence classes and positive area computation are provided. The concept of relative and absolute of important evaluation method by rough set are proposed. The condition of absolute importance and its prove was given. The differences between important evaluation method that based on rough set and frequent statistics respectively are also discussed. The important medicine of chronic viral hepatitis type B (HBV) is analyzed based on the rough set importance evaluation method.The problems of data reduction, including relative reduction and absolute reduction are introduced and unified as the set operation on difference list that is come from the difference matrix. A heuristic reduction algorithm based on ant colony system was proposed.Finally, an introduction of the study of compatibility of TCP is given, including the history and characteristic of TCP, the pretreatment of TCP data, the construction of TCP database and the design of analyses system of TCP.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络