节点文献

基因分类及基因表达数据分析方法的研究

Research on Gene Classification and Analysis Methods of Gene Expression Data

【作者】 蔡立军

【导师】 林亚平;

【作者基本信息】 湖南大学 , 计算机应用技术, 2007, 博士

【摘要】 随着人类基因组计划(Human Genome Project)的基本完成,生命科学的研究进入了后基因组时代(Post-Genome Era),在后基因组时代,生命科学研究的重点从单个基因的研究上升到对整个基因组功能和动态变化规律的研究,从而产生了对海量生物信息进行处理的需求;而计算机技术的革命性发展,形成了处理海量生物信息的能力。于是,生物信息学便在综合计算生物学的研究和生物学信息的计算机处理的基础上迅速而成功地发展起来。生物信息学是计算机和网络大发展、各种生物数据库迅猛增长形势下如何组织数据,并从数据中提取生物学新知识的学问。基因芯片或微阵列技术(Gene Chip or Microarrays)是最近分子生物学实验技术的一个突破,利用该技术可以同时对成千上万个基因的表达数据进行平行分析,产生了海量的有用数据,分析与整理这些数据成为利用这一技术的一个主要瓶颈问题。本文主要研究基因分类及基因表达数据分析方法,主要工作和创新点概括如下:(1)介绍了基因分类的发展概况、微阵列技术以及常用的分类算法,并通过实验进行性能评价,为本文后续章节的研究提供理论和实验基础。(2)基因选择是基因芯片数据分析中的一个重要问题,要进行基因选择的主要原因在于基因数远远大于实验样本数。为此本文把蚁群优化算法(Ant ColonyOptimization Algorithm,ACO Algorithm)引入基因选择领域,并用基因与类别的相关性分析所得值初始化最优化问题,缩短了找寻最优解的时间;以基因子集整体的样本辨别能力与它所含基因间的平均距离的线性表达作为目标函数,有利于在找到关键基因的同时消除冗余;同时,不同于一般的包装基因选择算法,在计算目标函数的时候不需要对每个基因子集进行分类准确度的计算,从而有效地降低了计算复杂度,提高了方法的灵活性和适应性。(3)独立分量分析(Independent Component Analysis,简称ICA)是应用于基因分类的一种统计方法。但独立分量分析中的估计分离矩阵算法主要采用随机梯度算法和自然梯度算法,这些基于梯度下降的寻优算法很容易陷入局部极值,所得结果不精确。本文提出了一种基于遗传算法的基因分类算法,其基本思想是利用遗传算法代替独立分量分析中传统的估计分离矩阵算法,对基因表达数据进行分类,克服了结果不精确的问题。实验结果表明,该分类方法获得了更好的分类效果。(4)本文从分类算法和特征基因选择两个方面研究基因表达数据的分类,将传统的SVM算法和KNN算法两者结合成为一种新的应用于基因表达数据分类的算法,并针对基因表达数据分类数据集中“样本少,维数高”的特点,提出了一种改进的基于相关性的递归特征消除算法(简称为C-RFE),消除了数据冗余。实验结果表明,新方法可有效提高分类准确率和特征选取的效率。(5)针对基因表达数据的特征和单个分类器在进行基因分类时适用范围有限、分类准确度不高等问题,提出了一种新的基于神经网络的融合规则的多分类器组合模型的基因分类算法,克服了单个分类在进行基因分类时所呈现的不足,实验表明基于多分类器组合模型的基因分类算法能有效提高分类准确度,并能扩大分类器的适用范围。(6)聚类分析已经成为基因表达数据分析中的一种非常重要的分析方法,但怎样结合其他高层次的生物学知识对聚类结果进行进一步的分析和解释依然是功能基因组研究中一个亟待解决的问题。为此,本文提出一种简单的算法,结合GO和KEGG调控代谢路径注释信息对聚类结果进行分析,获得具有显著功能注释关联的共表达基因集合。然后在此基础上开发了相应的自动分析软件SigClust,同时用一组基因表达数据对该软件的预测能力进行了验证。

【Abstract】 With the near completion of the Human Genome Project, life science has usheredin the Post-Genome Era. In this era, the research focus has shifted from that onindividual gene to that on the functions and the dyna mics of the whole genome . Thisnew focus has given rise to a demand on the processing capability of a large quantityof biological information, and the revolutionary development of the computertechnology can meet this demand. Therefore, bioinformatics has sprung up from theintegration of studies in computational biology and the computer processing ofbiological information. Bioinformatics is the research abouthow to organize data toextract new knowled ge of biology in the context of the great development of computerscience, the Internet and various biological databases.The Gene Chip or Microarrays is a latest breakthrough of the experimentaltechniques for molecular biology. Microarrays can simultaneously analyse theexpression data of thousands of genes and thereby generate a large quantity ofavailable information. Analyzing and sorting out the data have been the bottleneck forusing this technique. This paper studies the classification of genes and the analysingmethods for genetic expression data. The research is characterized as follows:(1) This paper introd uces the development of gene classifica tion, microarrays, andcommon classification algorithms, and evaluates their performa nce throughexperiments to provide a theoretical and experimental foundation for the subsequentcha pters;(2) Gene selection is an important problem in gene chip data analysis, and thereason of gene selection lies in the fact that the number of genes is far grea ter than thesize of the sample for an experiment. Therefore this paper introduces Ant ColonyOptimization Algorithm (ACO Algorithm) into the field of gene selection, and use thevalue obtained from the correlation analysis for the gene and its class to initialize theoptimiza tion problem, thus shortening the time for searching for the optimal solution.This paper takes as the objective function the linear expression of the samplediscrimina tive ability of the subset of genes and the mean distance between genes inthe gene subset, which helps locate the key genes and simulta neously eliminates theredundancy. Not like the traditional packing algorithm of selection, the objectivefunction does not require the accuracy of all the subsets of gene, so the computationa l complexity is effectively reduced with enha nced flexibility and adaptability.(3) Independent Component Analysis (ICA) is a statistical proced ure for geneclassification. But the estima ted separation matrix algorithm in ICA mainly adoptsrandom grads algorithm, and natural grads algorithm. Those algorithms, which arebased on the descent of grads , are liable to fall into local extreme values and thusderiving inaccurate results. On the basis of genetic algorithm, this paper proposes agene classification algorithm, the funda mentalidea of which is to replace the estimatedseparation matrix algorithm in the ICA with genetic algorithm to classify the geneticexpression data, and overcome the problem of inaccuracy of the result. Experimenta lresults show that the classification proced ure prod uces better classifica tion results;(4) This paper researches into the classification of the gene expression data fromtwo aspects of the classification algorithm and the feature gene selection, andintegrates SVM algorithm and KNN algorithm into a new classifica tion algorithm forgene expression data. In light of the feature of small samples and high dimensions ofthe gene expression data, this paper proposes an improved correlation-based recursivefeature elimination algorithm (C-RFE) and successfully eliminates the redundancy indata. Experimental results show that the new procedure can effectively raise theaccuracy of classification and improve the efficiency of feature selection;(5) In view of the features of gene expression data, and the limited applicabilityand the inaccuracy of ind ividual classifier for gene classification, this paper proposes anew gene classification algorithm which is a multi-classifier combination model basedon fusion rules in neural networks, and remedies the inadequacy of individualclassifiers. Experiments show that this new procedure can improve the accuracy andthe applicability of classification;(6) Custer analysis has become an important analysing procedure for geneexpression data, but how to further analyse and explain the results for cluster analysisin terms of biological knowledge at higher levels is still a problem in functionalgenome research. This paper has proposed a simple algorithm, i.e., analysing thecluster analysis results with the help of GO and KEGG metabolic regulation pathannotation and obtained a co-expression gene set with remarkable correlation in theannotation of gene function. And then, on that basis , we have developed automaticanalysis software SigClust, and tested the predictative power of the software with agroup of gene expression data.

  • 【网络出版投稿人】 湖南大学
  • 【网络出版年期】2008年 06期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络