节点文献
多分类器集成系统在基因微阵列数据分析中的应用
The Application of Multiple Classifier Systems in the Analysis of Gene Microarray Datasets
【作者】 刘昆宏;
【导师】 黄德双;
【作者基本信息】 中国科学技术大学 , 模式识别与智能系统, 2008, 博士
【摘要】 多分类器集成系统是当前机器学习领域的一个研究热点。由于使用多个基分类器构建的集成系统通常比单个优秀的分类器具有更强的泛化能力,因此多分类器集成系统为许多基于传统模式识别方法很难解决的分类问题提供了新的解决方案。DNA微阵列技术是一种由物理学、微电子学与分子生物学等几个领域综合交叉形成的高新技术,该技术已经在医学与生物学上得到越来越广泛的应用,其中在癌症分析检测上的应用使得在大规模基因水平上深入研究癌症的发生、扩散等病理特征成为可能。特别地,进行可靠的癌症类型诊断与预测、癌症关键基因的识别和癌症的分类已成为当前癌症研究中的两项重要内容。尽管如此,由于微阵列数据具有数据维数高、样本数少的特点,因而使用常规的模式识别方法并不能总是获得理想的结果。本文主要针对多分类器集成系统在基因微阵列数据集上的应用进行了深入的分析与探讨,并设计了新的集成系统,以更好地解决微阵列数据的分类判别问题。全文的主要工作概括如下:(1)从机器学习的角度分析,癌症关键基因识别问题的核心是特征选择问题。本文集合filter方法,分别设计了基于标准遗传算法和多目标遗传算法的集成特征选择方法。实验中,首先使用filter方法对基因进行初步筛选,进而使用遗传算法进一步实现特征选择,然后将所选择的一组特征子集分别用于构造基分类器,以生成集成特征选择系统。实验结果表明,所设计的集成特征选择算法能有效地选择合适的基因子集,而且这种集成系统获得了良好的识别性能。(2)独立分量分析是一种近几年来新提出的线性变换方法,它已经成功地应用在微阵列数据分析上。本文借鉴了集成特征选择方法的思路,设计了集成独立分量选择系统。这种系统首先使用独立分量分析算法对微阵列数据进行线性变换,之后使用遗传算法选择合适的独立分量子集,并分别用于构建基分类器。由于使用这种方法能保证各个基分类器间的差异度,因此最后使用投票法将各个基分类器进行组合,即能构成稳健的集成系统。(3)在应用于微阵列数据分析中,通常独立分量分析算法得到的独立分量集并不总是可重复的。本文利用独立分量集之间的差异,提出一种新的构造集成系统的思路。这个集成系统基于多目标遗传算法,通过对独立分量分析变换后获得的不同独立分量集分别进行筛选,从各个不同的独立分量集中分别获得较优子集,用以构建基分类器。实验结果表明,使用这种方法,能够获得差异度更大的基分类器,因而最终的集成系统具有更优的性能。(4)旋转森林是一种新提出的多分类器集成系统,其特点在于使用线性变换方法生成旋转矩阵,使数据可以投影到不同坐标系中,从而构建有差异的分类器。由于这种系统要求数据集的特征维数不能过高,因此不能直接用在基因微阵列数据分析判别中。本文使用filter方法对基因微阵列数据进行降维,以获得适合旋转森林的数据集。此外,我们还引入独立分量分析技术作为一种新的产生旋转矩阵的方法。在两个常见数据集上的实验结果表明,旋转森林在基因微阵列数据判别中能获得较优的识别效果,并且基于独立分量分析的旋转森林能获得最佳的识别性能。(5)关键基因选择与癌症类别判别方法对处理多类癌症微阵列数据集往往比对两类癌症数据集更困难。其原因在于对多类问题,每类的样本数少,且往往各个类别样本数不均衡。本文设计了一种基于子集成系统的遗传规划,以同时实现特征选择和类别判定。首先,算法将多类问题分解为多个两类问题,然后,在遗传规划算法设计中,使用规模较小的集成系统(称为子集成系统)来分别处理各个两类问题,并将这些子集成系统融合起来,以构成一个个体。由于每个个体都包含一组子集成系统,因此它具有较强的泛化能力,且能直接处理多类判别问题。本文给出了基于特征的差异度测度,并使用局部优化算法来确保各个子集成系统的差异度,从而进一步提高系统运行的效率。实验结果表明,本文设计的算法能同时有效实现关键基因的选择与癌症类别的判定。
【Abstract】 Multiple classifier system (MCS) has drawn much attention in the field of machine learning. Owing to the fusion of a set of base classifiers, the final ensemble system has better generalization ability compared with a single excellent classifier. So the ensemble system is a promising solution for many problems, which are ’hard’ for the traditional pattern classification methods.DNA microarray technology is a newly developed technology, formed by the interdiscipline of physics, electronics and molecular biology, etc. Microarray technology has been widely applied to the study on biological and medical fields. Among its applications, the microarray technology based cancer diagnosis makes it possible to deeply study the cancer pathological mechanism, including the occurring and diffuseness of cancer. In order to achieve reliable diagnosis and prediction on the type of cancers, many researches are focused on the identification of key genes to different cancers and the classification of cancers. However, due to problems with the small sample size and high dimensions, the traditional methods can not always achieve good performances.This thesis is focused on the analysis and classification of microarray datasets based on multiple classifier systems. The main work of this thesis can be concluded as follows:(1) The selection of key genes in the microarray dataset is regarded as a feature selection problem usually. In this study, the merits of filter and wrapper methods are combined to design two ensemble feature selection systems, which are based on a standard genetic algorithm (GA) and a multi-objective GA, respectively. With these methods, filter methods are applied to pick up a set of genes firstly, and then the GAs are used to select proper subsets so as to construct base classifiers. The corresponding experimental results show that these methods are capable of selecting optimal feature subsets, and the ensemble systems built in this way are robust.(2) Independent Component Analysis (ICA) is a recently proposed linear transformation method, and has been applied to the analysis of microarray datasets successfully. Inspired by the ensemble feature selection, an ensemble independent component selection method is proposed. In the application of this ensemble method, a microarray datasets is transformed by the ICA algorithm to obtain an independent component (IC) set firstly, and then a standard GA is used to pick up a set of IC subsets from the IC set to construct different base classifiers. Because this method can guarantee the diversity among the base classifiers, the ensemble system will be robust even when simply combining the base classifiers using majority vote rule.(3) When applying ICA algorithms to microarray datasets, it is found that the results are not always reproducible. That is, after different ICA transformations, different IC sets will be obtained. So in this thesis, a multi-objective GA is proposed to select optimal IC subsets from different IC sets. Then these IC subsets are used to train base classifiers, which are used to build the ensemble system. With this method, the diversity among base classifiers is much higher than the former method, so this ensemble system is of great generalization ability.(4) Rotation forest is a newly proposed ensemble system, and its success lies in that a linear transformation method is deployed to build a rotation matrix, which is then used to project the data into different axes. In this way, diverse base classifiers are obtained. As this ensemble system requires great computational cost when classifying datasets with high dimensions, it has never been proposed to deal with the microarray datasets. In this thesis, filter methods are used to reduce the dimension of datasets so that the Rotation Forest can be used to analyze the microarray datasets. And here, ICA is employed to construct the rotation matrix for the first time. The experimental results show that Rotation Forest can achieve better performance compared with other ensemble schemes, and ICA based Rotation Forest achieves the highest classification accuracy.(5) The classification problem in multiclass microarray datasets is much more difficult compared with two-class datasets, because usually the samples belonging to each class are fewer and the distributions of samples in different classes are unbalanced. To efficiently classify multiclass microarray datasets, a GP is proposed based on the idea of splitting multiclass problem into multiple two-class problems. The characteristic of this GP is that each individual consists of a set of small-scale ensemble systems (named as sub-ensemble here), which are used to tackle respective two-class problems. In this way, each individual can solve a multiclass problem directly. And this GP can be used to solve feature selection and classification problem at the same time. Here, a diversity measure is proposed based on the difference among the features in each tree, and a greedy local improvement algorithm is used to maintain the diversity among the sub-ensembles. These measures ensure the high efficiency of the GP.
- 【网络出版投稿人】 中国科学技术大学 【网络出版年期】2009年 06期
- 【分类号】TP181
- 【被引频次】15
- 【下载频次】724