节点文献

基于支持向量机的蛋白质结构域预测方法研究

Research on Prediction of Protein Domains Based on Support Vector Machines

【作者】 邹淑雪

【导师】 周春光;

【作者基本信息】 吉林大学 , 计算机应用技术, 2009, 博士

【摘要】 生物信息学是随着人类基因组计划的启动而兴起的一门新的交叉学科,是以计算机为工具对生物信息进行储存、检索和分析的科学。随着人类基因组计划宣告完成,生命科学进入后基因组时代,其研究重点也主要转移到基因组学和蛋白组学两方面。其中蛋白组学是以细胞内全部蛋白质的存在及其活动方式为研究对象,而传统的对单个蛋白质进行研究的方式已无法满足后基因组时代的要求。生物信息学在蛋白质高级结构的解析中的重要性将越来越突出。分析蛋白质首先就是确定蛋白质结构域的构成,这是研究蛋白质的最重要步骤。检测蛋白质的结构域是一个富有挑战性的问题,特别是仅从序列信息直接进行结构域分析逐渐成为结构域预测的主要研究目标。本文针对从蛋白质序列信息检测结构域边界信号问题进行了较深入的研究。1.根据多序列比对结果,定义了几种方法对比对结果进行特征提取,根据蛋白质的构象特征计算种子序列的构象熵值,并利用信息熵理论使得结构域信息最大化,最后使用支持向量机学习系统对提取的特征值进行分类,首先根据序列分析结果提出了相关特征并进行支持向量机学习。2.经过探究支持向量机参数对结构域边界信号不敏感的原因,首次提出将蛋白质结构域边界检测问题归结为非平衡数据学习问题,即蛋白质结构域问题中的结构域内部为多数的负类;结构域边界为少数的正类,提出了在支持向量机特征空间中对与正类样本具有距离最大熵值的负类样本进行采样的新的欠采样方法。3.在支持向量机学习前,对训练集利用本文提出的基于遗传算法进行采样,为了更有效的评价采样后训练样本的分类器效果,本文采用AUC (Area Under ROC Curve) ,ROC曲线下的面积,作为分类器性能评价指标,并将其作为遗传算法的适应度函数。实验结果表明本文提出的采样技术明显好于随意采样技术,而且在蛋白质结构域的预测应用中明显优于单独使用支持向量机分类器。4.借助支持向量机与模糊分类系统的等价性理论证明,提出了基于支持向量机的模糊分类系统模型。首先利用SVM的学习算法获得分类系统的稀疏表示,然后将获得的分类系统映射成等价的正定模糊分类系统,再利用模糊集合的贴近度概念和粒子群优化方法对模糊分类系统的模糊规则库进行约简和优化。模糊分类系统具有更好的范化能力,其学习过程等价于SVM系统参数的优化,但具有较快的训练速度。

【Abstract】 Since proteins provide some of the most fundamental information about many processes in almost all organisms, the ability to predict protein structure and functionhas become one of the most important goals in bioinformatics research. Protein domains represent one of the most useful avenues for the understanding of protein function and domain family-based analysis, and are of great importance in the study of individual proteins. Detecting the domain structure of a protein is a challenging problem that how to determine where is the amino acids in the protein domain or in the domain boundary for a given protein sequence. In detail there are two problems. One is that where are the domain or boundary in a given protein structure. The other is that the same problem in a sequence without the known structure. Relatively speaking the latter is more difficult.Support Vector machines (SVM) are a new statistical learning technique that can be seen as a new method for training classifiers based on polynomial functions, radial basis functions, neural networks, splines or other functions. Support Vector machines use a hyper-linear separating plane to create a classifier. For problems that can not be linearly separated in the input space, this machine offers a possibility to find a solution by making a non-linear transformation of the original input space into a high dimensional feature space, where an optimal separating hyperplane can be found. The performance of SVM drops significantly while facing imbalanced datasets, though it has been extensively studied and has shown remarkable success in many applications. Once more it is difficult to avoid such decrease when trying to improve the efficient of SVM on imbalanced datasets by modifying the algorithm itself only. Therefore, as the pretreatment of data, sampling is a popular strategy to handle the class imbalance problem since it re-balances the dataset directly.In this thesis there is an intensive study on the domain boundary detection only using a given protein sequence.A promising method for detecting the domain structure of a protein from sequence information alone was presented. Given a query sequence, our algorithm starts by searching the protein sequence database and generating a multiple alignment of all significant hits. The columns of the multiple alignment are analyzed using a variety of sources to define scores that reflect the domain-information-content of alignment columns, such as the conservation measures on the composition and classification of amino acids in each multiple alignment column, consistency and correlation measures, measures of structural flexibility. Information theory based principles are employed to maximize the information content. Besides we quote a method to predict domain boundary from protein sequence alone. The method is based on theory that the protein unique three dimensional structure is a result of the balance between the gain of attractive native interactions and the loss of conformational entropy. These scores are then combined using a support vector machine to label single columns as core-domain or boundary positions The overall accuracy of the method for a single protein chains dataset, is about 85 %.A novel undersampling method using distance-based maximal entropy in the feature space of SVMs is proposed. Its unique learning mechanism makes it an interesting candidate for dealing with imbalanced datasets, since SVMs only takes into account those data that are close to the boundary, i.e. the support vectors, for building its model. What’s more important, as kernel-based methods, the classification of SVMs is defined in the feature space. So does our undersampling preprocessing. Therefore, those negtives that are very close or distant to a given possitive one, would not be sampled. The negtives too close to the learned hyperplane may have skewed hyperplane and far away from it could not be the support vector but be trained with uselessness. While for the ones separated by the distance close to the mean distance, their contributions are very high. The negtives which have the maximal entropy value with counterpart possitives are undersampled, in this way, the input data are no longer imbalanced. Thus the learned hyperplane is further away from the positive class. This is done in order to compensate for the skew associated with imbalanced datasets which pushes the hyperplane closer to the positive class.Given a query sequence, our algorithm starts by searching the local sequences database and generating a multiple alignment of all significant hits. The columns of the multiple alignments are analyzed using a variety of sources to define scores that reflect the domain-information-content of alignment columns. Information theory based principles are employed to maximize the information content. Besides we get a feature extracted from the conformational entropy of a protein sequence. Thus we get an imbalanced training data set. Next we resample the data set and form N population initialization in Genetic Algorithm. We test respectively the two sampling techniques: over-sampling on minority and under-sampling on majority. SVM learn on each re-sampling training data set and corresponding AUC value is computed. The population is updated by three basic genetic operators, such as reproduction, crossover, mutation, according to the fitness value of AUC. The process of SVM learning and genetic population updated is iterated until convergence or reaching the max iteration. A fuzzy classification system model based on support vector machine is proposed in this paper.As a powerful tool in dealing with complex uncertainty problems, Fuzzy System Theroies (L.A. Zadeh et al.) have been succeeding in many applications such as signal processing and pattern recognition.However, they often suffer from the curse os dimensionality for the high-dimentional data. SVM and Fuzzy Systems are complementary in such cases. Some researcher gave the equivalent relation proof on SVM and positive definite fuzzy classifier, which made it possible to combine SVM with Fuzzy Systems. Reduction methods are developed to minimize the complexity of the system by reducing the linguistic terms in the fuzzy rules based on the similarity of fuzzy sets, and removing the redundant and inconsistent fuzzy rules. Finally, the particle swarm optimization is used to adjust the system parameters for compensating the deviation caused by the reduction. Experimental results show that the methods are feasible and effective.

  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2009年 09期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络