节点文献

机器学习方法及其在生物信息学领域中的应用

Machine Learning Methods and Their Applications in Bioinformatics

【作者】 王淑琴

【导师】 梁艳春;

【作者基本信息】 吉林大学 , 计算机应用技术, 2009, 博士

【摘要】 生物信息学是八十年代末随着人类基因组计划的启动而兴起的一门新兴交叉学科,它是当今生命科学和自然科学的重大前沿领域之一,是生物学与计算机科学以及应用数学等学科交叉而成。利用生物信息学方法能够处理大规模数据,抽取出所需信息,从而更好的认识生命,揭示生物界的奥秘。随着基因组项目的不断完成,大量有待于分析和解释的数据呈指数级增长。数据量之大,研究之深入,以及基因组数据本身的复杂性之高,对理论、算法和软件的发展都提出了迫切的需求。而机器学习方法例如遗传算法和决策树等正适合于处理这种数据量大、含有噪声并且缺乏统一理论的领域。本文对机器学习方法及其在生物信息学中的应用进行了一定的研究,主要工作有以下四个方面:1.提出一种基于变精度粗糙集的决策树构造方法。提出了变精度明确区和变精度非明确区的概念。并给出基本的基于变精度粗糙集理论选取决策树分支属性的算法。利用UCI国际开放数据库中的19个数据集作为测试集对提出的方法进行测试,并将结果与较流行的决策树生成算法C4.5所得到的结果进行比较研究。2.提出一种基于多方法引导的遗传算法的操纵子预测方法。应用不同的方法来评价不同的基因组数据以充分发挥各自的生物特点。提出了一种局部熵最小化的方法来评价基因间距离。实验结果显示基于多属性信息的预测能力高于基于单个属性的预测能力,也证明了E. coli的基于局部熵最小化得到的基因间距离区间得分可用于其它基因组操纵子预测。3.提出基于变精度粗糙集的决策树构造的操纵子预测方法。使用基因间距离、COG功能、代谢pathway、微阵列表达数据、系统进化谱和保守基因对六种基因组数据进行操纵子预测。在E. coli、B. subtilis和P. aeruginosa三个基因组上进行测试,并与C4.5进行了比较,实验结果表明这是一种有效的操纵子预测方法。4.提出一种基于信息熵的改进k-TSP癌症分类预测方法,首先使用信息熵的方法来选取特征基因,然后使用k-TSP方法进行癌症分类预测。将公开的二类基因表达谱数据集作为实验数据集,采用留一交叉校验法来计算实验中预测的准确率,并将此方法与其他7种机器学习方法进行比较,取得了较好的效果。

【Abstract】 Bioinformatics is an interdisciplinary subject with start-up of the Human Genome Project at the end of the eighties. It is one of the great frontiers of life sciences and natural sciences. It will be one of core fields of natural sciences in the 21st century’s. It is formed from several subjects such as biology, computer science and applied mathematics. Bioinformatics researches include biology data collection and management, database search and sequence alignment, genome sequence analysis, gene expression data analysis and processing, protein structure prediction, and the construction of metabolic pathway, signal pathway and gene regulatory networks, etc.Bioinformatics methods can be used to deal with large-scale data, extract the necessary information, so that we can better understand and reveal the mysteries of living systems. With the accomplishment of the genome sequencing projects, data to analyze and explain is increasing exponentially. So many data and in-depth studies need urgently the developments of theories, algorithms and software. In addition, because of the complexity of the genome data itself, it also needs more urgently the developments of them. Machine learning methods such as neural networks, genetic algorithms, decision tree and support vector machines, etc. are suitable for the field in which there is large amount of data, containing noise and lack of a unified theory.In this thesis, we do some researches on machine learning methods and their applications in bioinformatics. The main jobs include the following four aspects:1. We present a new approach for inducing decision trees based on Variable Precision Rough Set Model (VPRSM). Decision tree classification method is popular in mathine learning. The current methods of constructing decision trees are based on the purity measurement methods, such as information entropy, the Gini index. From the Rough Set theory point of view, the common character of these methods is only to consider the information of implicit region, without considering the information of explicit region. Correspondingly, the rough set based approaches for inducing decision trees consider the information of explicit region. The more certain the information is, the better the results are. In real applications, however, data always contains noises. The methods based on rough set divide accurately the samples, so that they can’t avoid that noises effect on constructing the decision tree. In order to reduce the classifier’s sensitivity to noise data and improve classifier generalization ability, we introduce variable precision rough set theory in constructing decision tree classifier, and propose approach for inducing decision trees based on Variable Precision Rough Set Model. We propose two main concepts, i.e. variable precision explicit region and variable precision implicit region, and give the algorithm of inducing decision trees based on variable precision rough set model. The comparison between the presented approach and C4.5 on some data sets from the UCI Machine Learning Repository is also reported. Experimental results show the approach for inducing decision trees based on Variable Precision Rough Set Model is superior to the classical decision tree algorithm C4.5, especially before pruning.2. A novel multi-approach guided genetic algorithm for operon prediction is presented. Because the fuzzy rules used in Jacob’s approach are intuitive, it is difficult to create its fuzzy rules for non-specialists. Moreover, it used the same method for assessing each genome data, so that it can’t explore the biological characteristics for genome data. So we use different methods to preprocess different genome features for exerting their unique characteristics, and utilize intergenic distance, participation in the same metabolic pathway, COG gene functions and microarray expression data to predict operons. A novel local-entropy-minimization method (LEM) is proposed to partition intergenic distance for evaluating intergenic distance. LEM divides the intergenic distances into several intervals and assigns a score for each interval. COG function log-likelihood is computed for adjacent gene pair. Correlation coefficient of microarray expression value is calculated. At last, genetic algorithm is used to fuse the above four genome features and predict operons. The proposed method is examined on Escherichia coli K12 genome, Bacillus subtilis genome, and Pseudomonas aeruginosa PAO1 genome. The accuracies of prediction of 85.9987%, 88.296% and 81.2384% for the three genomes are obtained respectively. Experimental results demonstrate that prediction performance using multiple features is better than that only using one feature. Experimental results also show that it is possible to use intervals of intergenic distance obtained by using Local-Entropy-Minimization method in Escherichia coli for operon prediction in other prokaryotic genome.3. We present an operon prediction methods by decision tree classifier based on Variable Precision Rough Set. We increase two genome features: phylogenetic profile and conserved gene pairs, except for intergenic distance, COG gene functions, metabolic pathway, microarray expression data used in the 4th chapter. We introduce how to extract phylogenetic profile and conserved gene pairs. Firstly we use 360 genomes and BLAST program to compute phylogenetic profile of each gene and conserved gene pairs of each gene pair. Then the hamming distances of phylogenetic profile of adjacent gene pairs are computed. We give frequency distribution and Log-likelihoods for different distances of the phylogenetic profile. At last, we take these six genome features as the input data of the proposed method. The proposed method is examined on Escherichia coli K12, Bacillus subtilis and Pseudomonas aeruginosa PAO1, and is compared with C4.5. Experimental results show that the proposed method is an effective method of operon prediction.4.An entropy-based improved k-TSP method (Ik-TSP) for classifying cancer is proposed. Because the method proposed by Aik Choon Tan chooses the top k high-score pairs of genes as decision rule instead of only the highest gene pair. So, the method needs to calculate the score of each gene pair and determine the decision rules according to the scores of all gene pairs. In fact, each cancer dataset has a huge size (the datasets used in this paper contain at least 2,000 genes), so the algorithm has relatively high time and space complexity. So we propose an entropy-based improved k-TSP method for classifying cancer. We use the information entropy for key genes selection, and then use k-TSP method to predict classes of cancers. In order to evaluate the performance of Ik-TSP method in classification prediction, we consider 9 binary gene expression datasets, which are used by Aik Choon Tan, as our experimental datasets. Leave-one-out cross-validation (LOOCV) is employed to estimate the prediction accuracy in our experiments. Compared with the results of seven other existing machine learning methods, Ik-TSP method obtains averagely 95.44% accuracy, and improves 3% better than k-TSP method. We have obtained some reseaches on operon prediction and cancer prediction. These researches have enriched the study of machine learning theory application. They provide theoretical basis for the application of operon prediction and cancer prediction. Operon prediction provides valuable information for the reconstruction of regulatory networks and drug design. Cancer prediction provides a new method for finding gene marker. It can promote early diagnosis and treatment of cancer.

  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2009年 08期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络