节点文献

基于DNA微阵列数据的癌症分类技术研究

The Research of Cancer Classification Based on DNA Microarray Data

【作者】 于化龙

【导师】 顾国昌;

【作者基本信息】 哈尔滨工程大学 , 计算机应用技术, 2010, 博士

【摘要】 随着人类基因组计划的基本完成,生命科学的进程进入了后基因组时代,在后基因组时代,研究的重点从对单个基因的研究上升到了对整个基因组功能和动态变化规律的研究,从而产生了对海量生物信息进行处理的需求。DNA微阵列(也称基因芯片)技术的出现是后基因组时代的一个主要标志,同时也是目前生物信息学研究的主要领域之一。通过此技术,可以同时检测成千上万个基因在生物体内的活性,为从分子层次上对疾病,尤其是癌症进行诊断、分型、致病机理的研究以及药物的快速开发提供了极大的便利。但是由于实验成本的限制,使得微阵列数据集通常包含的样本数较少,因此造成了其高维小样本的特点。如何从这些高维小样本数据中挖掘有用的生物学信息并使用这些信息对癌症的检测与分型提供有效的指导,便成为了机器学习与模式识别领域研究的当务之急。本文主要围绕癌症微阵列数据分类问题开展研究,具体研究成果主要包括:(1)缠绕型特征基因选择方法通常具有以下两个缺点:收敛速度过慢和易陷入局部最优。故此提出了两种基于群集智能的特征基因选择方法:基于蚁群的特征基因选择方法和基于改进的离散粒子群的特征基因选择方法。前者实现简单,且可以快速的获取一个较优解,有效地解决了现有方法收敛速度过慢的问题。而后者则通过增加一条简单的规则使算法可以巧妙地避开局部最优解,具有更强的寻优能力。(2)针对现有的选择性集成分类方法通常具有较高时间复杂度的问题,提出了一种基于相关分析的集成分类方法,其通过将差异的选择从分类器层转换到训练子集层这一巧妙的策略有效地降低了计算的复杂度,同时可以保持分类的准确率并节省存储的开销,具有较强的实用性。(3)提出了一种基于可信分析的多类微阵列数据分类方法。该方法的思想是首先使用“一对多”支持向量机对样本进行分类,然后评估分类结果的可信性,对可信度低的样本采用一种称为基于质心距离的类别优先级评估方法进行评判。该方法的优势在于提高了分类的精度,且并未在计算复杂度方面有显著地增加。(4)考虑微阵列数据集小样本的特性,提出了一种基于无标签样本的癌症增量诊断方法。该方法的思想是首先使用现有的有标签样本训练一个诊断系统,使其在实际的临床诊断中为测试样本作出判别,并对判别结果的可信度作出定量的评估。然后根据可信度的高低来决定是否需要人类医学专家的辅助判别。最后将新标记的样本加入到有标签样本集中并更新系统。该方法在保证诊断精度的同时,兼顾了系统的利用率,同时可使诊断系统的性能得到增量的提高。与传统方法相比,其在实际临床诊断中具有更强的实用性。

【Abstract】 With the near completion of Human Genome Project, life science has entered into the Post-Genome Era. In this era, the research mainly focuses on the functions and dynamics of the whole genome but not individual gene. This has given rise to a demand on the processing capability of a large quantity of biology information. DNA microarray (i.e. gene chip) technology is one of major marks of Post-Genome Era and primary research fields in Bioinformatics. By this technology, the expression level of tens of thousands genes may be detected simultaneously. It has been widely applied to diagnose disease especially for cancer at molecular level, recognize subtypes, make clear the principle of a specific disease and develop new medicines rapidly. However, owing to expensive experimental cost, only a few samples are embedded in microarray dataset which leads to high dimension and small samples. Therefore, how to mining useful information and taking advantage of them to guide cancer classification and subtype recognition have been emphasized in machine learning and pattern recognition. This paper mainly research some related aspects of cancer classification based on microarray data, detailed work are listed as below:(1) Wrapper feature gene selection methods generally hold two drawbacks: slow convergence and local optimum. Therefore, two feature gene selection methods based on swarm intelligence are proposed: feature gene selection method based on ant colony optimization and feature gene selection method based on improved discrete particle swarm optimization. The former implements easily and can acquire an excellent solution rapidly which solve the problem of slow convergence effectively. While the latter may avoid local optimum by adding an easy rule, so that new optimum solutions are constantly found.(2) Generally, selective ensemble classification method has high time complexity. Therefore, an ensemble classification method based on correlation analysis is presented in this paper. It decreases computation complexity by extracting diverse classifiers at training subset level but not classifier level. Meanwhile, the proposed approach may keep classification accuracy and save storage cost, which enhances the method usability.(3) A multiclass microarray data classification approach is developed in this paper. Firstly, one-versus-rest support vector machine is used to classify for testing samples. Then the confidences of the classification results are evaluated and some samples with low confidence are extracted. At last, the extracted samples are estimated by a novel strategy named as class priority estimation method based centroid distance. The proposed method improves recognition rate and meanwhile the computation complexity hasn’t obvious increase.(4) Considering small sample size of microarray data, an incremental cancer diagnostic method based on unlabeled samples is proposed in this paper. At first, an initial diagnostic system is trained with a few exsiting labeled samples and it will provide diagnosis for testing samples in clinical, the confidences of diagnostic results will be estimated quantificationally, too. Then the samples are decided whether to be returned to human medical experts for diagnosing with other detection methods or not according to the confidences. At last, the new labeled samples will be added into labeled samples set to update the system. The proposed method simultaneously guarantees diagnostic accuracy and utilization of the system. Meanwhile, it is permitted to improve the performance of itself incrementally. Compared with traditional approaches, the proposed method is more practical in clinical.

节点文献中: