节点文献

基于集成学习算法的若干生物信息学问题研究

Research on Topics of Bioinformatics Employing Ensemble Learning Algorithm

【作者】 钮冰

【导师】 陆文聪;

【作者基本信息】 上海大学 , 材料学, 2009, 博士

【摘要】 20世纪后期,由于人类等生物物种基因组学以及生物科学技术的飞速发展,生物信息发生了惊人的增长,这极大地丰富了生物科学的数据资源,并随之诞生了一门新兴的交叉学科:生物信息学,其目的在于通过对生物学实验数据的获取、加工、存储、检索与分析,揭示数据所蕴含的生物学意义。数据挖掘技术用于在数据中发现潜在有用的知识,在生物信息学研究当中,正发挥着越来越重要的作用,而且取得了丰硕的成果。本文应用集成学习方法来对生物信息中的若干问题进行讨论。本文的主体工作分为四个部分:1.用集成学习算法预测蛋白质结构和功能定位。随着生物技术的不断发展,越来越多的蛋白质序列被测定出来,探索利用理论及计算方法来研究蛋白质结构和功能定位具有重要意义。本文从蛋白质的一级序列出发,基于氨基酸组成进行蛋白质序列特征编码,使用了AdaBoost与Bagging这两种集成学习算法来对蛋白质的结构类型、膜蛋白类型和蛋白质亚细胞定位进行预测。在建模过程中,分别使用了RandomForest,KNN和C4.5三种不同的弱学习算法来作为基本分类器,并用基于10组交叉验证法的计算结果对建模参数进行优化。结果表明:(1)用AdaBoost-RandomForest算法预测蛋白质结构类型时,预测结果良好,对于所选用的两个标准数据集,其留一法预报准确率分别可以达到94.18%和85.9%,优于先前文献报导的预报结果;(2)用AdaBoost-C4.5算法预测原核和真核蛋白亚细胞定位时,其留一法预报准确率分别达到91.80%和80.80%,优于先前文献报导的预报结果;(3)用Bagging-KNN算法预测膜蛋白类型问题时,其留一法预报准确率可以达到84.42%,优于先前文献报导的预报结果。根据以上所建立的预测模型,我们同时开发了相应的在线预报系统。2.用集成学习算法研究小分子的生物功能。研究小分子生物功能,在分子生物学领域能帮助人类理解生命现象,在医学领域帮助人类认识疾病机理。由于通过实验来发现小分子的生物功能会耗费大量的人力、物力和财力,且具有一定的盲目性和风险性,因此,用集成学习方法来研究这个问题具有实际意义。本文中我们首先研究了小分子代谢途径类型的预测问题,提出了基于官能团组成的小分子编码方法,用AdaBoost-C4.5算法建模,其交叉验证预报准确率达到74.05%,对独立测试集的预报准确率达到75.11%。然后,我们又研究了小分子与酶相互作用的预测问题,用AdaBoost-C4.5算法建模,其交叉验证预报准确率达到81.76%,对独立测试集的预报准确率达到83.35%。结果表明,集成学习算法可以用来研究小分子的生物功能,所建模型有很好的预测性能。此外,我们根据所建立的小分子代谢途径类型和小分子与酶相互作用的预测模型,同时开发了相应的在线预报系统。3.运用集成学习算法AdaBoost来研究苯酚类化合物毒性机理预测的问题。我们从文献中收集了274个苯酚化合物,计算了45个分子描述符,用基于互信息增益的CFS(Correlation-based Feature Subset)算法筛选出9个分子描述符。基于这9个描述符,我们分别以C4.5,RandomTree,RandomForest和KNN四种算法作为基本分类器建立AdaBoost模型,经过优化和验证后,最终选用C4.5为基本分类器建模。最后,又与SVM和KNN算法的预报性能做了比较,结果表明AdaBoost算法在苯酚类化合物毒性机理预测中,有良好的预报能力,其交叉验证和对独立测试集的预报准确率分别达到96.3%和92.8%。基于该研究内容,建立了相应的在线预报系统。4.使用mRMR- KNN集成方法研究HIV-1蛋白酶的裂解位点预测。首先,使用AAindex的531个氨基酸残基指数对8肽进行编码,然后使用mRMR特征筛选方法得到了500个特征。在此基础上,使用改进的Wrapper搜索方法得到了含有364个特征的子集。最后用最近邻方法(KNN)建模预测HIV-1蛋白酶裂解位点,其留一法测试和对独立测试集的预报准确率分别可以达到91.3%和87.3%。通过对500个特征进行生物学分析,我们发现:(1)P1位点和P2’位点对于HIV-1蛋白酶底物的特异性所作贡献最大, (2)P1位点上的氨基酸残基主要是疏水性残基,而P2’位点上的氨基酸残基主要由二级结构决定,以上两点结论与先前通过实验所得到的文献结论相吻合。本工作结果表明: mRMR方法结合改进的Wrapper方法能够对生物数据集进行有效的特征筛选;在此基础上建模,不仅可以得到满意的预测结果,而且所选的特征具有生物学意义。因此,mRMR方法有望成为生物信息学领域特征筛选的一个重要方法。

【Abstract】 In the late 20th century, with the rapid development of bioscience techniques、human genomics and other life genomics, the information of biology increased with surprising speed, which greatly enriched the bioinformation resource and led to the birth of bioinformatics. In Bioioformatics, researchers try to discover encyclopedic biological knowledge by captureing, managing, depositing, retrieving and analyzing biological information. Data mining technology is used to extract potential and useful information from the databases, and is playing an increasingly important role in the study of bioioformatics. In this paper, ensemble learning methods were used to investigagete some topics of bioinformatics. The main work of the paper contains following four parts:1. Using ensemble learning algorithm to study the prediction of protein structure and function types. With the success of human genome project, the protein sequences entering into the data banks are rapidly increasing. The structures and functions of these proteins may be determined by means of experiments, but it is very time-consuming and almost impossible. Thus the scientists have being sought after the theoretical or computational methods for predicting the structures and functions of proteins. AdaBoost and Bagging were employed to classify or predict protein structures and function locations based on sequence amino acid composition in this dissertation. During the modeling process, four different weak machine learning mtehod were used to build model, and the modeling parameters were optimized based on the results of cross-validation of the models. The results show that: (1) The best model with prediction accuracies of 94.18% and 85.90% were obtained by using AdaBoost-RandomForest in leave-one-out cross-validation for two standard data set of protein structure, respectively; (2) The best models with prediction accuracy of 91.80% and 80.80% were obtained by using AdaBoost-C4.5 in leave-one-out cross-validation for subcellular location of Prokaryotic and Eukaryotic Proteins, respectively;(3) The best model with a correct rate of 84.42% was obtained by using Bagging-KNN in leave-one-out cross-validation for membrane protein. All the prediction accuracies by using ensembe learning method are better than the previous results reported. Based on the models of predicting subcellular location and membrane protein, two corresponding online web servers were established.2. Using ensemble learning algorithm to study the prediction of small molecules’metabolic pathways and small molecule and enzyme interaction-ness. Firstly, based on AdaBoost method and featured by function group composition, a novel approach is proposed to quickly map the small chemical molecules back to the possible metabolic pathway that they belong to. As a result, 10 folds cross validation test and independent set test on the model reached 74.05% and 75.11%, respectively. Secondly, based on above research, we try to use amino acid physicochemical properties to code enzyme, resulting in totally 160 features. These features are input into AdaBoost classifier to predict the interaction-ness. As a result, the overall prediction accuracies, tested by 10-folds cross-validation and independent set, are 81.76% and 83.35%, respectively. Based on the models of prediction of small molecules’metabolic pathways, small molecule and enzyme interaction-ness, two corresponding online web servers were built.3. AdaBoost Learner is employed to investigate toxic action mechanisms of phenols based on molecular descriptors. 274 phenols from different references were collected, and 45 descriptors were calculated. Firstly, 9 descriptors were selected by using CFS (Correlation-based Feature Subset) method. Then C4.5,RandomTree,RandomForest and K nearest neighbors (KNNs) were employed as basic classifiers of AdaBoost to build the model, and C4.5 is selected. Finally, the performance of AdaBoost Learner is compared with support vector machine (SVM) and, KNN which are the most common algorithms used for SARs analysis. As a result, AdaBoost Learner performed better than SVM and KNNs in predicting the mechanism of toxicity of phenols based on molecular descriptors. It can be concluded that AdaBoost has a potential to improve the performance of SARs analysis. We also developed an online web server for the prediction of ecotoxicity mechanisms of phenols.4. Knowledge of the polyprotein cleavage sites by HIV protease will refine our understanding of its specificity, and the information thus acquired is useful for designing specific and efficient HIV protease inhibitors. Recently, a number of classifier creation and combination methods were proposed to approach the HIV-1 protease specificity problem. The pace in searching for the proper inhibitors of HIV protease will be greatly expedited if one can find an accurate, robust, and rapid method for predicting the cleavage sites in proteins by HIV protease. In this work, we selected HIV-1 protease as the subject of the study. Two hundred ninety-nine oligopeptides were chosen for the training set, while the other sixty-three oligopeptides were taken as a test set. The peptides are represented by features constructed by AAindex. The mRMR method (Maximum Relevance, Minimum Redundancy) combining with Incremental Feature Selection (IFS) and Feature Forward Search (FFS) are applied to find the 2 important cleavage sites and to select 364 important biochemistry features by jackknife test. Using KNN (K-nearest neighbours) with selected features, the prediction model with high accuracy rates of 91.3% and 87.3% were obtained for Jackknife cross-validation test and independent-set test, respectively. It is expected that our feature selection scheme can be used as a useful assistant technique for finding effective inhibitors of HIV protease.

  • 【网络出版投稿人】 上海大学
  • 【网络出版年期】2010年 05期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络