节点文献
外膜蛋白序列和结构辨识相关问题研究
Research on Relevant Problems of Discriminating Sequences and Structures of Outer Membrane Proteins
【作者】 邹凌云;
【导师】 王正志;
【作者基本信息】 国防科学技术大学 , 控制科学与工程, 2008, 博士
【摘要】 蛋白质组学是生物信息学的主要研究领域之一。膜蛋白作为被广泛利用的药物靶,是蛋白质组学的重要研究对象。膜蛋白家族中的外膜蛋白,定位于革兰氏阴性细菌、叶绿体和线粒体的外膜,折叠成桶状的跨膜结构,是两类主要的跨膜蛋白之一。外膜蛋白与革兰氏阴性细菌致病性和免疫功能密切相关,是极具研究价值的药物靶,并且参与非特异性调控、物质运输和选择性离子通道形成等物理化学过程。本文以外膜蛋白生物信息学研究为主题,通过对蛋白质序列编码方法、分类算法、结构预测模型的改进和创新,来提高外膜蛋白序列、结构辨识水平,并解决与此相关的部分问题。论文主要研究内容和创新点如下:(1)外膜蛋白序列辨识和基因组挖掘方法研究研究从其它蛋白质折叠类型中辨识外膜蛋白的方法,主要目的是:应用于基因组内挖掘新的外膜蛋白及其对应的编码基因;为序列分析和结构预测积累新的数据。本文利用分散量理论,提出了基于最小分散增量的外膜蛋白序列辨识方法,并进一步改进为多分散增量预测结果加权投票预测方法。该方法为蛋白质序列辨识提供了易实现和易推广到多类问题的新手段。此外,为满足基因组挖掘外膜蛋白的需要,提出了蛋白质序列多种联合特征编码方法,在联合特征中引入加权的氨基酸指数相关系数特征,并将优选的特征编码方法和支持向量机分类算法结合来建立分类器。无论是数据集上测试还是基因组内挖掘,该方法都达到了目前最好的预测水平,成为有效的外膜蛋白挖掘工具。此外,文章还利用特征选择技术分析了高维联合特征的优化问题,采用过滤方法筛选有效的特征子集,提高了计算速度乃至预测效果。(2)多类蛋白质分类算法研究支持向量机是具备优异泛化性能的机器学习技术,但是没有很好地解决多类分类问题,存在诸如分类盲区、误差累积等缺点。模糊支持向量机的出现为改进这些缺点提供了新手段。本文采用基于样本紧密度的模糊隶属度计算方法,并同时计算样本作为正例和作为反例的双份误差,重构了支持向量机的最优分类面,建立了“一对一”方式和有向图方式的双向模糊分类器。在解决膜蛋白分类问题时,该分类算法降低了对孤立点和噪声点的敏感性,一定程度上改善了分类效果,是模糊多类支持向量机的新发展。(3)外膜蛋白信号肽和拓扑结构联合预测方法研究跨膜蛋白拓扑结构预测的意义在于:一是提供从二级结构推测其三级结构的模型框架;二是有利于对二级和三级结构进行修正。现有的外膜蛋白拓扑预测方法,在应用于前体序列预测时,没有提供预测信号肽的功能,并且由于信号肽的影响,拓扑预测性能会下降。本文应用隐Markov模型理论,建立了外膜蛋白前体序列信号肽和拓扑结构联合预测模型,使得在模型中信号肽成为拓扑结构的一部分,并利用最新的知识优化模型结构。该预测模型具备了目前最好的外膜蛋白拓扑预测性能,并成为集信号肽剪切位点预测、拓扑预测和序列辨识功能于一体的便利工具。(4)跨膜蛋白亚细胞定位预测方法研究现有的大部分蛋白质亚细胞定位预测方法,针对水溶性蛋白的特性而设计,不能有效预测跨膜蛋白的亚细胞位置。而基于隐Markov模型的拓扑结构预测方法,虽然利用了跨膜拓扑信息,但是没有提供亚细胞定位预测功能。本文对跨膜蛋白拓扑预测模型进行改造,使之成为亚细胞定位预测工具,在预测细胞分泌路径上跨膜蛋白的亚细胞位置时,具有显著高于普通预测方法的性能,填补了跨膜蛋白亚细胞定位预测的空白,并且为拓扑预测器开辟了新的应用方向。(5)调控外膜蛋白的非编码小RNA预测方法研究非编码小RNA预测是具有重大生物学价值的难点问题。目前还没有专门预测调控某一类蛋白质的非编码小RNA的方法。本文提出了主成分分析-神经网络预测模型。该模型通过主成分分析去除特征相关、降低特征维数,改善了神经网络预测器的性能,成为辨识细菌非编码小RNA的有效工具;此外,考虑到碱基配对是非编码小RNA与外膜蛋白mRNA作用的主要方式,设计了两级筛选系统预测调控外膜蛋白的非编码小RNA。该系统通过碱基配对打分函数来搜索基因组内与已知外膜蛋白mRNA以高分值进行配对的非编码区域,然后利用主成分分析-神经网络预测模型过滤搜索结果中的大部分冗余。其优点是可以降低实验筛选的成本,并提供少冗余的实验对象。
【Abstract】 Proteomics occupies one of main fields of bioinformatics research. The research on membrane proteins takes a remarkable station in proteomics, because of the importance of membrane proteins as drug targets for disease treating and as main functional components in boimembranes. As an especial family of membrane protein, outer membrane proteins (OMPs) reside in the outer membranes of gram-negative bacteria、chloroplasts and mitochondria, and a majority of them fold into beta-barrel structures by 8-22 beta-strands, and compose themselves to two transmembrane protein types together with alpha-helical membrane proteins. OMPs perform a variety of functions, such as mediating non-specific, passive transport of ions and small molecules, selectively allowing the passage of molecules, and are involved in voltage-dependent anion channels. Further, OMPs relate to bacterial adhesion, toxicity release and immunity, and so are becoming valuable drug targets for anti-gram-negative bacteria. Discriminating sequences and structures of OMPs are keeping challenges because of difficulties in experimental validation and structural resolution of them. Various computational approaches are emerging for solving these problems. Focus on the topic of OMPs bioinformatics, this dissertation refers to studies on protein sequence encoding, developing classification algorithms and designing new models, for improving accuracy of discriminating OMPs and for solving other relevant problems. The main contents and contributions of the dissertation are summarized as follows:(1) The research on new approaches for discrimination of OMPs from other protein folding types, and for OMPs mining in genomes.There are two main application fields of OMPs discrimination methods:the first is mining of new OMPs and corresponding genes in genomes; the second is accumulating new data for predicting secondary and tertiary structures of OMPs. Two new approaches have been developed for discrimination of OMPs in this research. One of them is a prediction method based on the theory of measures of diversity in biomathematics. In this method, the increment of diversity is used for measuring differences between OMPs and other proteins. This method is easy for implement and to extend for multiclass protein classification. Another of them is developed on the basis of combined sequence features and support vector machine algorithms (SVM). In this method, a protein sequence is encoded by a combined feature encoding scheme, which combines weighted amino acid index correlation coefficient with amino acid composition and dipeptide composition. This method performs better than existing methods in literature for discrimination of OMPs, which provides an effective tool for new OMPs mining in genomes. Furthermore, feature selection techniques are studied for improvement of the combined feature encoding scheme. A filter method has been presented to select the most effectual features in combined features, which is helpful for accelerating the classification process, and even for improvement of prediction performance.(2) The research on algorithms for multiclass protein classification problemsSVMs often perform better than other machine learning techniques in binary classifications. But some problems are keeping unsolved for multiclass SVMs, such as blind regions and errors cumulation. Therefore, several fuzzy SVM algorithms have been introduced to improve multicass SVMs in literature. This reaserch presents a bidirectional fuzzy SVM algorithm, which treats each sample not only as a positive sample but a negative sample. In this algorithm, a sample contributes double errors from being positive and being negative. Further, the fuzzy membership is defined by not only the relation between a sample and its cluster center, but also those among samples, which is described by the fuzzy connectedness among samples. The bidirectional fuzzy SVM algorithm is implemented by "one-vs-one" frames or Directed Graph frames. In tests of membrane protein classification, it is not sensitive to outliers or noises, and outperforms traditional "one-vs-rest" and "one-vs-one" multicalss SVMs.(3) The research on methods for combined prediction of signal peptides and topologies of OMPsThe topology prediction of transmembrane proteins contributes to two aspects: firstly, it offers a frame from secondary structures of OMPs to investigate their tertiary structures; secondly, it is helpful for revising the structural prediction of OMPs. However, existing topology predictors can not predict signal peptide of OMPs precursors. At the same time, the accuracy of them will decline because of the influence of signal peptide sequences. A predictor based on hidden Markov models is developed for combined prediction of signal peptides and topologies of OMPs in this research. In the model, the signal peptide is treated as a part of the whole topology of an OMP precursor, and the architecture is optimized to fit the natural structure of OMPs. This model performs better than other models for topology prediction, and further can be applied for signal peptide prediction and discrimination of OMPs in genomes.(4) The research on methods for transmembrane protein subcellular localization predictionExisting methods for protein subcellular localization prediction are mainly designed for soluble proteins, and usually are not accurate for transmembrane proteins. On the other hand, all topology predictors are designed for transmembrane proteins but are not available for subcellular localization prediction. This research described a new approach to predict subcellular localization of transmembrane proteins, which is an alteration of existing topology predictors, and can give better accuracy than existing methods. It is the only approach for transmembrane proteins subcellular localization prediction, and is also a new application of topology predictors. (5) The research on methods for recognizing small non-coding RNAs in OMPs regulationPrediction of small non-coding RNAs (sRNAs) for regulation is a difficult problem with grand biological value. There is not an approach has been presented for prediction of sRNAs which regulate a given protein type. This research describes a method for prediction of bacterial sRNAs. In this method, a principal component analysis (PCA) process is performed to reduce dimensions and eliminate the correlation of sRNA sequence features, and a BP neural network (NN) is constructed for classification. This PCA-NN classifier can effectively predict bacterial sRNAs, and thus is adopted in a two-phase filtering system for prediction of sRNA regulators of OMPs. The two-phase system searches non-coding regions for sRNA candidates by a base pair scoring between OMP mRNAs and genomic non-coding regions in the first phase, and then filters redundant candidates using the PCA-NN classifier in the second phase. The prediction system can provide less redundant objects for experiments than general methods.
【Key words】 Proteomics; Bioinformatics; Outer membrane protein; Machine learning; Measure of diversity; Support vector machine; Hidden Markov model; Small non-coding RNA;