节点文献

大肠早癌辅助诊断数据挖掘方法研究

Research on Data Mining Techniques for Computer-aided Colorectal Carcinoma Diagnosis Systems

【作者】 廖志芳

【导师】 樊晓平;

【作者基本信息】 中南大学 , 控制科学与工程, 2008, 博士

【摘要】 随着医疗诊断技术的发展,各个医疗部门积累了大量医疗诊断信息,如病人的医学影像资料、生理生化指标、生物信息学指标、病人背景资料等,这些数据资料背后隐藏着很多有可能成为临床辅助诊断依据的重要信息,因此有必要利用相关技术对这些重要信息进行分析处理。数据挖掘是广泛应用于医疗诊断数据分析处理的技术之一,采用数据挖掘技术可以通过对患者资料数据库中大量历史数据的处理,挖掘出有价值的诊断规则,从而依据患者的年龄、性别、生活习性、辅助检查结果、生化指标等做出判断,排除人为因素的干扰,客观性强,得到的诊断规则有着较好的普遍性。本文以数据挖掘技术为基础,以激光诱导自体荧光大肠早癌诊断数据为载体,通过分析诊断数据特征,从数据预处理、训练数据集的形成以及分类预测方法三个方面,对大肠早癌诊断数据进行深入分析研究,形成激光诱导大肠早癌辅助诊断系统,为临床医生提供辅助诊断的手段。本文首先分析了激光诱导自体荧光诊断大肠早癌的机理、特点、研究意义,根据医疗诊断数据特征,提出了激光诱导自体荧光大肠早癌辅助诊断数据分析处理流程,并对各部分进行了分析,着重阐述光谱数据采集系统组成以及光谱数据的采集方法,同时进行了滤除高频电子噪音,剔除光谱基线、截取有效带宽信号以及归一化荧光光谱的数据除噪处理。面向不完整的大肠早癌荧光数据,通过分析比较特征提取方法,本文提出基于容错关系的信息熵粗糙集主成分分析算法,容错关系粗糙集较之传统粗糙集能满足诊断数据的不完备性,同时引入随信息量减小而单调下降的信息熵,在此基础上提出属性约简方法,对光谱数据进行属性约减,并利用主成分分析算法进行进一步的特征属性提取。通过该算法,提取了影响大肠早癌诊断的特征数据,降低数据维度,减少后续数据处理的复杂度。由于医疗诊断数据中多为混合数据的特性,通过分析现有混合数据聚类算法,本文提出了基于格论的混合数据聚类算法。利用格进行数据分布以消除数值型属性和符号属性的分布差别,利用数据间格的涵盖数目来进行聚类计算,因此该算法在进行混合数据处理时不再需要进行数据转换。针对算法中的参数,即初始聚类数目和中心点的选取进行了优化分析,其中初始聚类数目利用遗传算法进行优化,获得初始聚类数目的取值空间;同时对中心点的选取进行了优化说明,同时对算法性能进行了分析。以形成的聚类数据集为基础,利用均值方差法和荧光强度比值判别法进行数据特征的提取,得到正常组织和癌症组织的分类特征,为分类判别提供依据。针对医疗诊断数据中实时性要求,通过分析所采用的分类算法性能,发现该分类算法存在着大量重复计算,因此算法复杂度和算法的空间复杂度比较高。为解决这一问题,本文提出了基于检索树结构的处理方法,通过构建检索树,将多数重复计算节点构建在检索树的高层,无重复节点建立在检索树的下层,以此来降低算法的重复计算,有效地降低了算法复杂度以及空间复杂度,以满足诊断实时性要求。针对医疗诊断数据中的不平衡性,在分析了非平衡数据分布特征以及当前的非平衡数据处理方法后,利用样本处理技术,本文提出了全局密度非平衡数据分类,μ-密度非平衡数据分类方法以及边界样本局部密度的非平衡数据分类方法,全局密度非平衡数据分类方法以各自类别的样本为基础进行综合平均,这种方法有利于稀疏数据的分类而降低密集数据分类有效性;μ-密度非平衡数据分类方法通过代价敏感方法,分析样本分类正确性代价,得到合适的μ值进行样本数据的选取,以提高非平衡数据分类有效性;边界样本局部密度的非平衡数据分类方法着重分析处于非平衡数据集中的边界样本数据,通过多种方法进行边界数据的分类,同时对算法中的相关参数进行分析。这三种算法都是通过样本数据选择,提高少数类样本数据量以减少数据非平衡性。论文最后总结了全文的创新点,提出了今后将继续进行的研究方向。

【Abstract】 With the development of Bioinformatics and Biomedical Engineering, a lot of medical information including medical imagine resource, physiological guideline, bioinformation and some patients’ stuff are available in many hospitals and research groups. We need to analyze the information as some useful information is concealed by the general processing methods which sometimes can be the aided diagnosis rules.Data mining technology is improved quickly in biomedical areas. It can be used to process ocean-store history medical data that results some useful diagnosis rules derives from the patients’ information including age, gender, habits and examine results, so the rules are in popular items with no inference and large-scale data processing.This dissertation presents the research issues to process Auto-Fluorescence Spectrogram for Colorectal Carcinoma by data mining techniques with the steps of preprocessing, forming the training samples, building the classification model. Some Auto-Fluorescence Spectrogram for Colorectal Carcinoma Aided Diagnosis Methods will be built with the research results, and try to provide the ways to the doctors for the diagnosis.This dissertation first analyses the theory, characteristics of Auto-Fluorescence Spectrogram for Colorectal Carcinoma, and presents the modules in Auto-Fluorescence Spectrogram for Colorectal Carcinoma Aided Diagnosis System, together with the details of each part. And some methods to derive noises from the spectrogram are provided.To meet the requirement of data incomplety, the dissertation presents an algorithm, called RPCA, to deal with the attributes reduction by rough set with PCA based on tolerant relation. A novel definition of entropy is introduced which knowledge decreases as the granularity of information becomes smaller. Then a new reduction algorithm in tolerant rough set is presented, extract the data feature together with PCA. With the algorithm, data feature cab be extracted, data attributes can be reduced, and the complexity can be reduced as well for later testing.As most biomedical data are hybrid data, the dissertation presents a clustering algorithm based on lattice for hybrid data. The algorithm uses lattice to eliminate the difference between ordinal and nominal samples without exchanges which affects the algorithm accuracy. And the parameters in this Algorithm are optimized as well. Genetic Algorithm is used to optimize the initial clustering number and the mean points are optimized as well. With the clustering samples, we use several ways to get the rules between normal and pathology tissues.To solve the time-restrict problem, a novel Index algorithm for classification is designed and applied to solve this problem. The algorithm uses index tree to reduce the repetition calculation and gets higher efficiency both on computation and storage amount, especially in the application with large scale repetition data.To deal with the data unbalance, the dissertation presents several ways to solve the problem as Overall-density unbalance classification,μ-density unbalance classification and Margin-density unbalance classification algorithms. All of these ways are based on the samples theory as increasing the sparse data number and obtain higher performance, especially on unbalance data processing. Some parameters in these algorithms are analyzed, as a cost-sensitive way is presented to optimizeμby the cost of right and error ratio; and other two parameters in Margin-density unbalance classification algorithm are analyzed as well.Finally, the innovations of this thesis have summarized. And the future research subjects were also presented.

  • 【网络出版投稿人】 中南大学
  • 【网络出版年期】2010年 02期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络