节点文献

蛋白质序列数据的分类预测研究

The Study on Classification and Prediction for Biological Protein Sequential Data

【作者】 刘惠

【导师】 杨杰;

【作者基本信息】 上海交通大学 , 模式识别与智能系统, 2007, 博士

【摘要】 序列数据是数据挖掘问题中一类特殊数据,广泛存在于社会生活各个领域,如何从这些复杂海量序列数据库中挖掘蕴含其中的有用信息是数据挖掘的新研究课题,具有重要理论意义和实际价值。本论文以蛋白质序列数据为例进行序列数据分类研究,亦为生物信息学中课题。论文围绕蛋白质序列数据的分类预测这一主题,在综合众多序列数据分析算法的基础上,将序列特征分析归纳为两类主要方法,基于特征提取的方法和基于相似性模型的方法,由此将研究路线分为两条。一方面基于特征提取方法,分别针对膜蛋白及信号肽序列,根据序列各自特性提取相应特征进行分类。另一方面,基于相似性模型,提出基于全序列比对的相似度以预测信号肽,进而嵌入核空间提高预测稳定性,达到提取序列明确属性向量的目的,至此实现两条技术路线的统一。论文还进一步通过线性降维实现冗余及不相关维数约简及可视化。总的来说,本论文集中于蛋白质序列的分类预测研究,着重于以下几个创新点:(1)针对不同序列有区别有目的地提取序列特征生成属性向量,从而训练分类器并提供对新样本的预测。其中对于序列长度相对较长的膜蛋白序列,首先进行数值化编码生成时间序列,将其作为各样本以不同时间间隔抽样的离散信号,从而基于数字信号处理理论进行序列分析,避免了以往算法忽略序列次序信息的缺点。分析发现借助信号低频的幅度及相位信息,可以有效提取序列特征并可减少噪声带来的影响。实验结果表明这种基于频域的特征提取方法可以有效提取膜蛋白序列特征,以利于分类预测。(2)在对序列长度相对小的信号肽序列预测时,采用滑动窗截断的方式将不等长序列转换为固定长度的序列片断,经过互信息分析发现其内部各位点间存在复杂的耦合作用,针对已有算法盲目定义这种耦合作用的情况,提出基于多决策树方式提取规则,并借助其识别信号肽及其断裂点。经实验证明这种处理方式在信号肽预测问题中可有效提高序列片断及信号肽剪切点的预测率。(3)以相似性作为分类预测的基石,定义基于全序列比对的相似度预测信号肽,避免了采用滑动窗所带来的不平衡样本等诸多问题。通过分析此相似度的数学特性,详细证明其为一种度量。另外将其应用于信号肽预测中,在预测率及稳健性方面获得了良好效果,结果表明此相似度确实可以表征样本之间的相似关系,并为预测分类提供了良好的信息表示方式。提出的算法已经通过internet在网上提供相应使用服务,为扩大算法的使用范围提供了快速有效的途径。(4)探讨非正定核的处理方法,在分析基于全序列比对的相似度与欧氏距离偏差基础上,提出基于全序列比对的非正定核算法,并应用于信号肽分类预测中;另一方面,在保证预测率的前提下,实现提取序列样本特征向量的目的,将问题重新化归于基于特征的模式识别问题。实验结果表明算法确实可以有效提取蛋白质序列特征,方便信号肽预测工作。(5)针对线性降维中的“小样本问题”,充分利用类内离散度矩阵的空空间的特性,提出新的降维方法,且有效处理了小特征值导致的不稳定问题。信号肽预测工作中,在已经得到高维属性向量前提下,约简大量冗余和不相关属性,提高处理效率并实现了可视化的要求,取得了理想的效果。

【Abstract】 Sequential data is a kind of special data in data mining, and widely exists in diverse fields. How to extract or mine knowledge from large amounts of sequential data is a new research topic, and has theoretical and practical importance. In this paper, we study the classification and prediction for sequential data, especially for biological sequence.In this dissertation, we focus on the topic of classification and prediction for biological sequence. With many analysis algorithms for sequential data, we summarize them as two kinds of methods, algorithms based on feature extraction or those on similarity. On one hand, based on feature extraction method, extract the different features for different kinds of sequence, membrane proteins and signal peptides. On the other hand, we propose similarity based on global alignment for prediction, and then embed the similarity into kernel space to improve the stability. With these methods, the feature vector can be got and the method based on similarity is united with the feature extraction method. Feature reduction is also studied and sequential data can be visualized. The innovative ideas in this dissertation are as follows:(1) Based on the traditional pattern recognition algorithm, extract various features according to different sequence and then train classifier for predicting new samples. For membrane proteins, first encode them as discrete-time series sampled by different sampling interval, and then analysis the series by digital signal processing theory. This method avoids the loss of sequence-order information as other algorithms did. In the frequency domain, we extract low-frequency feature, magnitude as well as phase, to represent the main series information and decrease the noisy. The experiment illustrates the performance of feature extraction by low-frequency spectrum for predicting membrane protein types.(2) For the short sequences, such as signal peptides, sliding-window is adopted to transform diverse-length sequences to length-fixed segments and complex coupling affect is found by mutual information, while many former algorithms just blindly simplify that information. Then, the multi-decision tree is proposed to extract statistical rulers for predicting signal peptides and their cleavage sites. Promising result is got in the experiment.(3) Taking similarity as foundation for classification, we defined the similarity model based on global alignment, and avoid the shortcomings of sliding-window methods, such as imbalance problem. By analyzing the mathematical characteristic, the similarity is proved to be a kind of measurement. When applied to predicting signal peptides, the similarity gets the stable high prediction rate. The result demonstrates the defined similarity can well represent the relationship between sequences and provide a suitable form for them. On-line bioinformatics web server is also available for promoting the development of biology science.(4) Study on the indefinite kernel. Fristly, analyzing the different between the similarity based on global alignment and traditional Euclidean distance. We proposed indefinite kernel algorithm and apply it to predict the signal peptides. On the other hand, the feature vector can be got with high prediction rate and the method based on similarity is united with the feature extraction method. Experiment proves the performance.(5) Study on reducing the data dimension and extract the useful features for classification. Making full use of the null space of within-class scatter matrix, we propose Separated Space based Linear Discriminant Analysis(SSLDA) and avoid the unstability of traditional LDA. For signal peptides, with the high-dimension got by indefinite kernel based on global alignment similarity, we apply SSLDA and get reduced dimension. And sequential data can also be visualized

节点文献中: 

本文链接的文献网络图示:

本文的引文网络