节点文献

基于多特征的集成分类器在基因表达数据分类中的应用

Application of Multi-feature Based Classifier Ensemble for Gene Expression Data Classification

【作者】 赵亚欧

【导师】 陈月辉;

【作者基本信息】 济南大学 , 计算机应用技术, 2008, 硕士

【摘要】 随着人类基因组计划的发展,DNA微阵列技术作为一项革命性的技术应运而生。它可以自动、快速、高效的检测成千上万个基因的表达情况,通过分析所产生的基因表达数据,可以在分子层面了解细胞的生理状态,如生存、增殖、分化、凋亡、癌变和应激等等。这些问题对于医学临床诊断、药物疗效判断、解释疾病发生机制等方面有重要的作用。基因表达数据数目巨大且极其复杂,人们通过医学影像学的方法很难直接对其做出解释。因此,基因表达数据分类成为了生物信息学领域中一个十分困难的问题。早期,人们常常使用模式识别的方法,借助计算机的强大计算能力对其进行分类,取得了一些成果。最近几年,随着机器学习算法在生物信息学领域的应用日益广泛,机器学习的算法作为一种新兴的解决问题的方法被不少学者提出,用于基因表达数据分类。但遗憾的是,由于基因表达数据特有的样本少、特征多、非线性的特点,直接使用机器学习的方法还存在着一定的困难。这主要是因为:1.过多的特征使得重要特征被众多无关特征掩盖,使得分类器难以学习。2.样本数目过少,使得大部分分类器出现过拟合现象。为了解决特征众多的问题,往往通过对原始数据进行特征基因抽取以达到降维的目的;对于样本少的问题,常常采用分类器集成的方法来增强单个分类器的学习能力,从而提高分类的准确率。对于一个优秀的基因表达数据分类系统而言,特征基因的选择和分类器的集成是必不可少的两个步骤。然而,这两个步骤在实际应用往往是孤立进行的,前一个步骤并不能很好的为下一步奠定一个良好的基础,甚至有可能降低整体系统的分类准确率。本文通过总结前人常用方法的优缺点,将特征基因的选择与分类器的集成有机的结合起来,提出了基于多特征的集成分类器方法。其算法思想如下:该方法首先使用不同的特征基因提取算法如相关性分析,Golub方法,t检验方法等对数据进行特征提取,得到样本的多个特征子集。然后通过可重复采样技术,在不同的特征子集中抽取样本形成训练子集。由于训练子集是在不同的特征子集中抽取的,所以具有更大的差异性。而后使用一组神经网络学习这组特定的训练子集,为了保证神经网络不陷入局部最优,训练采用粒子群优化算法(PSO)。最后,基于“Many could be better than all”的选择性集成思想,使用分布估计算法(EDA)选取最优的神经网络分类器进行集成,做出最后的分类判决。为了验证方法的有效性,实验采用了国际通用的基因表达数据集Leukemia、Colon、Ovarian、Lung Cancer进行分类实验。结果表明,使用本文提出的方法比其他方法具有更高的分类准确率和稳定性。

【Abstract】 Along with the development of the Human Genome Program, the DNA microarray technology arises as a revolutionary technology at the time. It can detect tens of thousands of gene expression data automatically, rapidly and efficiently. Through analysis of the gene expression data, we can understand the physiological state of cells at the molecular level, such as survival, proliferation, differentiation, apoptosis, canceration, irritability and so on. These issues play an important role in medical diagnosis, drug efficacy judgment and disease explanation.Gene Expression data is very complex and the number is enormous. It is very difficult to be explained through medical imaging method directly. Thus, gene expression data classification has become one of the toughest questions in the field of bioinformatics. In the early time, the pattern recognition methods have often been employed and achieved some results with the help of the strong power of computers. In recent years, as machine learning algorithms are widely used in the field of bioinformatics, these methods are proposed for gene expression data classification as a new way. However, due to the few samples, the excessive features and nonlinear of the gene expression data, there are some difficulties to apply these methods directly. This is manly because: 1. important features are covered up by the excessive unrelated features and they are hard to be learnt by the classifiers. 2. Too few samples make the classifier over-fitted. In order to solve the first problem, feature selection methods have often been applied to reduce the dimensions. For the second problem, classifier ensembles have usually been used in order to increase the classification accuracy.For an excellent gene expression data classification system, the genetic feature selection and classification ensembles are the two essential steps. However, these two steps are often isolated in practical applications. The previous steps would not provide a good foundation for the next steps, and even reduce the overall classification accuracy.In this paper, a novel ensemble of classifiers based on multi features has been proposed. This method combines the genetic feature selection and classifier ensembles. The algorithm is expressed as follows: Firstly, in order to extract useful features and reduce dimensionality, different feature selection methods such as correlation analysis, Fisher-ratio is used to form different feature subsets. Then a pool of candidate base classifiers is generated to learn the subsets which are re-sampling from the different feature subsets with PSO (Particle Swarm Optimization) algorithm. At last, by the selective ensemble’s idea of“many could be better than all”, appropriate classifiers are selected to construct the classification committee using EDA (Estimation of Distribution Algorithms).Four common datasets namely Leukemia, Colon, Ovarian and Lung Cancer have been applied in order to test this method. Experiments show that our proposed method gives the higher classification accuracy and stability than the other methods.

  • 【网络出版投稿人】 济南大学
  • 【网络出版年期】2009年 09期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络