

The Research of Key-techniques in Knowledge Discovery System for TCM Pharmacology

【作者】 胡建军

【导师】 唐常杰;

【作者基本信息】 四川大学 , 计算机应用技术, 2006, 博士

【摘要】 中药治疗疾病在中国已有几千年的历史,它对人类身体健康具有重要意义。由于中药理论的复杂和不完善,传统方法在中医药的研究中遇到了很多困难,这严重阻碍了对传统中医药的继承和发展。数据挖掘是一门新兴的计算技术,它融合了数据库、数据仓库、人工智能、机器学习、神经网络、统计学、模式识别、信息检索、遗传算法等多学科知识,可以从大量数据中挖掘出事先不知道的、但又是潜在有用的信息和知识。在国家自然科学基金(编号:60473071,90409007)和国家中医药管理局基金(编号:2003JP40)支持下,我们把数据挖掘技术应用到中药方剂的研究中,试图从古今大量验方中挖掘出方剂的性、味、归经、功效等药理信息,为中医临床用药和研究提供辅助信息,为祖国医学的发展做出贡献。围绕这一课题的研究,本文提出了一些适合中药领域特点的数据挖掘算法,这些算法也可以用到其它数据挖掘场合。主要取得如下成果:1.证明了最近邻搜索定理,基于这一定理提出了SNN(Searching Nearest Neighbors)搜索算法。在逐点比较最近邻搜索中,需要两两比较所有的数据,其时间复杂度为O(n~2)。而SNN算法只需较少的比较次数就可找到最近邻数据,其时间复杂度为O(n*log(n)),当用扫描图像所得数据时,时间复杂度会降为O(n)。2.基于“同类相近”的思想,提出了实现任意形状高维空间聚类的NNAF(Nearest Neighbors Absorbed First)算法,其时间复杂度为O(n);提出了MLCA(Multi-Layer Cluster Algorithm)算法并证明了两个相关的定理。在多数聚类算法中,当改变阈值重新聚类时,需要重新开始执行原来的聚类操作,而使用MLCA算法在原聚类的基础上进行增量聚类,可以节省90%以上的时间。3.提出了基因表达式编程(Gene expression programming,GEP)算法中的初始种群精英个体产生策略(Elitism Producing Strategy,EPS),使得初始种群中具有较高适应度的个体,从而使整个进化从一个较高的起点开始。实验表明,EPS提高进化效率达17%。4.为了在GEP算法中产生较好的初始种群,提出了基因空间均匀分布策略(Gene Space Balance Strategy,GSBS)。用GSBS策略产生的初始种群基因多样性比用随机方式产生的要好的多,因此可以大大提高种群进化效率。实验表明,GSBS提高进化效率超过20%。5.提出了定量描述基因表达式编程算法中群体基因多样性测度公式。针对传统GEP在局部收敛方面的缺陷,提出使种群快速跳出局部最优的VPS-GEP(Various Population Strategy GEP)算法。实验表明,VPS-GEP算法减少了55%以上的进化停滞代数。6.结合中药药理知识发现原型系统的设计与实现,简述了文中所提算法在该系统中的应用,另外还讲述了系统结构设计、数据库设计、预处理方案设计等。

【Abstract】 Traditional Chinese Medicine (TCM) has been used to cure diseases over thousands years in China. It is significant for people’s health. There exist many difficulties in the research of TCM with traditional methods because of the complexity and imperfection of the TCM pharmacology. These prevent the succession and development of TCM.Data mining (DM) is a new computation technology. It fuses database, data warehouse, artificial intelligence, machine learning, artificial neural network, statistics, pattern recognition, information index, genetic algorithm and other field techniques. It is successful in mining practice to get useful information from large number of data.Supported by Grant of National Science Foundation of China (NO. 60473071, 90409007) and Grant from the State Administration of Traditional Chinese Medicine (NO. 2003JP40), we studied the mining techniques for properties, flavors, channel tropism, efficacy and other pharmacology information of TCM prescription from lots of prescriptions. These results can be used in researching of TCM.The main contributions include:1. The Searching Nearest Neighbor Theorem is proposed and proved. Based on the theorem SNN (Searching Nearest Neighbors) algorithm is proposed with time complexity O(n*log(n)) or O(n) if the data are gained by scanning image. All the data must be compared in the other searching nearest neighbor algorithms with time complexity O(n~2). However, only a few data are compared in SNN algorithm.2. Based on the idea that an object and the nearest neighbors are most probably in the same cluster, a clustering algorithm of NNAF to process multi-dimensional data with arbitrary shape is proposed, and its time complexity is O(n). In the case for threshold adjusted in the other clustering algorithm, the clustering procedure has to be performed again from begin to end. And the consumed time is nearly as many as the first time. However, when NNAF algorithm is performed and then the threshold is changed, the time can be saved more than 90% if user performs MLCA.3. Elitism Producing Strategy (EPS) in the initial population of Gene expression programming (GEP) is proposed, which can get higher fitness chromosome in the initial population. Thus the evolution can be start on higher level. The experiments show that the evolutionary efficiency can be increased by 17% using EPS.4. In order to produce excellent initial population of GEP, Gene Space Balance Strategy (GSBS) are proposed. The genes in the initial population of GEP produced by GSBS are diversified. The experiments show that the evolution efficiency can be increased by 20%.5. A criterion is proposed to quantitatively describe the gene diversity in population of GEP. In order to solve the problem of local optimization in the standard Gene expression programming, various population strategy (VPS-GEP) is proposed to make the evolution to skip from local optimization fast. The experiments show that VPS-GEP algorithm decreases the generations-stagnancy over 55%.6. The designation and implementation of the knowledge discovery system for TCM pharmacology are described. The algorithms proposed above are used in the system. In addition, the system architecture, database designation, and preprocess schemes of the system are described.

  • 【网络出版投稿人】 四川大学
  • 【网络出版年期】2008年 04期
  • 【分类号】TP182;TP311.13
  • 【被引频次】4
  • 【下载频次】502
