节点文献

计算机辅助医学影像诊断中的关键学习技术研究

Study on the Key Learning Technology in Computer-aided Diagnosis for Medical Image

【作者】 沈晔

【导师】 夏顺仁;

【作者基本信息】 浙江大学 , 生物医学工程, 2014, 博士

【摘要】 利用计算机技术辅助放射科医生进行病例诊断,即计算机辅助诊断(Computer Aided Diagnosis, CAD)在早期乳腺癌检查中起到越来越重要的作用,能有效帮助减少乳腺癌患者的死亡率。临床上已标记病例样本难以搜集同时阴性病例样本数远大于阳性病例样本数,因而在CAD应用中存在小样本、非平衡数据集的学习问题。非平衡及小样本学习问题是关于类别严重不对称及信息欠充分表达数据集的学习性能问题。非平衡及小样本学习在许多现实应用中具有重要意义,尽管经典机器学习与数据挖掘技术在许多实际应用中取得很大成功,然而针对小样本及非平衡数据的学习对于学者们来说仍然是一个很大的挑战。本论文系统地阐述了机器学习在小样本与非平衡学习环境下性能下降的主要原因,并就目前解决小样本、非平衡学习问题的有效方法进行了综述。本论文在充分理解常用欠采样方法在处理非平衡样本时易于丢失类别信息的问题基础上,重点研究如何合理、有效处理非平衡数据。论文提出两种欠采样新方法有效提取最富含类别信息的样本以此解决欠采样引起的类别信息丢失问题。另外针对小样本学习问题,论文提出新的类别标记算法。该算法通过自动标记未标记样本扩大训练样本集,同时有效减少标记过程中易发生的标记错误。本论文聚焦小样本、非平衡数据的学习技术研究。围绕非平衡数据集的重采样及未标记样本的类别标记等问题展开研究。论文的主要工作包括:(1)针对CAD应用中标记病例样本难以收集所引起的小样本学习问题,本论文利用大量存在的未标记样本来扩充训练样本集以此解决小样本学习问题。然而样本标记过程中往往存在错误类别标记,误标记样本如同噪声会显著降低学习性能。针对半监督学习中的误标记问题,本论文提出混合类别标记(Hybrid Class Labeling)算法,算法从几何距离、概率分布及语义概念三个不同角度分别进行类别标记。三种标记方法基于不同原理,具有显著差异性。将三种标记方法有一致标记结果的未标记样本加入训练样本集。为进一步减少可能存在的误标记样本对学习过程造成的不利影响,算法将伪标记隶属度引入SVM(Support Vector Machine)学习中,由隶属度控制样本对学习过程的贡献程度。基于UCI中Breast-cancer数据集的实验结果表明该算法能有效地解决小样本学习问题。相比于单一的类别标记技术,该算法造成更少的错误标记样本,得到显著优于其它算法的学习性能。(2)针对常用欠采样技术在采样过程中往往会丢失有效类别信息的问题,本论文提出了基于凸壳(Convex Hull,CH)结构的欠采样新方法。数据集的凸壳是包含集合中所有样本的最小凸集,所有样本点都位于凸壳顶点构成的多边形或多面体内。受凸壳的几何特性启发,算法采样大类样本集得到其凸壳结构,以简约的凸壳顶点替代大类训练样本达到平衡样本集的目的。鉴于实际应用中两类样本往往重叠,对应凸壳也将重叠。此时采用凸壳来表征大类的边界结构对学习过程是一个挑战,容易引起过学习及学习机的泛化能力下降。考虑到缩减凸壳(Reduced Convex Hull,RCH)、缩放凸壳(Scaled Convex Hull,SCH)在凸壳缩减过程中带来边界信息丢失的问题,我们提出多层次缩减凸壳结构(Hierarchy Reduced Convex Hull,HRCH)。受RCH与SCH结构上存在显著差异性及互补性的启发,我们将RCH与SCH进行融合生成HRCH结构。相比于其它缩减凸壳结构,HRCH包含更多样、互补的类别信息,有效减少凸壳缩减过程中类别的信息丢失。算法通过选择不同取值的缩减因子与缩放因子采样大类,所得多个HRCH结构分别与稀有类样本组成训练样本集。由此训练得多个学习机,并通过集成学习产生最终分类器。通过与其它四种参考算法的实验对比分析,该算法表现出更好分类性能及鲁棒性。(3)针对欠采样算法中类别信息的丢失问题,本论文进一步提出基于反向k近邻的欠采样新方法,RKNN。相比于广泛采用的k近邻,反向k近邻是基于全局的角度来检查邻域。任一点的反向k近邻不仅与其周围邻近点有关,也受数据集中的其余点影响。样本集的数据分布改变会导致每个样本点的反向最近邻关系发生变化,它能整体反应样本集的完整分布结构。利用反向最近邻将样本相邻关系进行传递的特点,克服最近邻查询仅关注查询点局部分布的缺陷。该算法针对大类样本集,采用反向k最近邻技术去除噪声、不稳定的边界样本及冗余样本,保留最富含类别信息且可靠的样本作为训练样本。算法在平衡训练样本的同时有效改善了欠采样引起的类别信息丢失问题。基于UCI中Breast-cancer数据集的实验结果验证了该算法解决非平衡学习问题的有效性。相比于基于k最近邻的欠采样方法,RKNN算法得到了更好的性能表现。

【Abstract】 Computer-aided diagnosis(CAD) which use computer technologies to assist radiologiest for diagnosis in decision-making processes can play a key role in the early detection of breast cancer and help to reduce the death rate from female breast cancer. But it is so hard to collect enough cases which are labeled by radiologist in clinic, and moreover, the number of positive cases is always much less than that of negative cases. So there always exists imbalanced and small sample learning in the CAD. The imbalanced and small sample learning problem are concerned with the perfermance of learning algorithms in the presence of severe class distribution skews and underrepresented data respectively. Learning from imbalanced and underrepresented data has great significance in the real world. Although machine learning and data mining techniques have shown great success in many applications, but the imbalanced and small sample learning are still the big challenges to researchers. In this dissertation, the main causes of degradation on the learning perfomace when the training dataset is small and highly imbalanced is explained firstly and then the popular and advanced solutions of this special learning task are systematically reviewed. Fully understanding the shortcomings of common under-sampling methods which give rise to the loss of class information, we focus on how to deal with majority class reasonalbly and in order to solve imbalanced learning problem effectively. Two novel under-sampling methods are proposed in this dissertation to avoid the loss of class information by selecting mostly representative samples effectively.In addition, a novel class-labeling algorithm is also proposed to solve the problem of the small sample learning. This algorithm expands the training dataset by labeling the unlabeled samples automaticly, and moreover, the mistakes of class labeling are decreased effectively.The problems of learning from imbalanced and underrepresented data are studied in this dissertation. We focus on how to deal with imbalanced learning problem by the novel resampling schemes and how to expand the training dataset by the novel class labeling scheme. The following paragraphs overview the contributions of this dissertation.(1)Aiming at dealing with the learning problem resulting from the underrepresented labeled training set in CAD. the proposed scheme in this dissertation is to enlarge the labeled training set by adding pseudo-labeled samples from the abundant unlabeled samples. However the mistakes always occur in the common class labeling algorithms, the samples labeled falsely would degrade the learning peformace as the noises. In order to avoid the labeling mistakes, a novel hybrid class labeling(HCL) algorithm is proposed. The HCL algorithm is formed by three different class labeling schemes from the view point of geometric similarity, probabilistic distribution and semantic concept respectively. There are the distinct differences among these three class labeling schemes which are based on the different principles. Only those unlabeled samples which get the unanimous labeling results from three different labeling schemes are added to the training set. In oder to go a step further in reducing the harmfulness for learning performance resulting from the still existing labeling mistakes, the memberships of pseudo-labeled samples are introduced to SVM in the algorithm. The contribution of pseudo-labeled sample to learning task is determined’ by its membership. Classification experimental results based on Breast-cancer dataset in UCI show that the proposed algorithm is effective to deal with the small sample learning problems and has less mistakes, better classification performance comparing with the other algorithms which adopted the single labeling scheme.(2)To deal with the loss of class information resulting from the common under-sampling methods, a novel under-sampling scheme based on convex hull(CH) is proposed in this dissertation. The convex hull of a dataset is the smallest convex set which contain all data points in this dataset. All data points lie inside the convex polygon or polyhedron formed by its vertices. Enlighted by the geometric characteristics of the convex hull, we try to sample the convex hull from majority class and its vertices are selected to form the reduced training set to balance the training set. In view of the fact that the data points from two classes are always overlapped in real-world applications, the convex hulls of two classes are also overlapped. In this situation, the training set represented by its convex hull is a challenge for learning task which can lead to the overfitting and degradation of generalization ability. Considering that both Reduced Convex Hull(RCH) and Scaled Convex Hull (SCH) would lead to the loss of class information, a novel structure of reduced convex hull, Hierarchy Reduced Convex Hull(HRCH), is proposed.lnspired by the obvious diversity and complementarity between RCH and SCH, we mix RCH and SCH together to build HRCH. By comparison with the other reduced convex hulls, HRCH contains more diverse and complementary class information and effectively alleviates the loss of class information during the reducing process. By choosing different reduced factor and scaled factor, Several diverse HRCHs are acquired from the majority class. Then each HRCH and minority class form a training set. Several learners learning from these training sets are integrated into the final classifier. Classification experimental results reveal that the proposed algorithm has better and more robust classification performance comparing with the other four traditional algorithms.(3)An improved under-sampling algorithm based on reverse k nearest neighbors(RKNN) is further proposed to overcome the loss of class information resulting from the common under-sampling. By comparison with k nearest neighbors(k NN), the RKNN examine the neighborhood globally. The RKNN of a data point is not only concerned with its surrounding points, but also concerned with the other points in the dataset. The change of data distribution can result in the change of reverse nearest neighbors for each point in the dataset. The characteristic of RNN is that the relationship of neighborhood can spread through the dataset. This characteristic overcomes the shortcoming that NN is only concerned with the local distribution. This algorithm trys to find more representative and reliable samples from majority class by removing noisy and redundant majority samples using RKNN, thereby balances the training set and avoids the loss of majority class information. Classification experimental results based on Breast-cancer dataset in UCI show that the proposed algorithm is effective to deal with the class-imbalanced problems and has better classification performance comparing with the scheme of k NN.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2014年 08期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络