节点文献

用生物统计方法预测蛋白质相互作用

Predicting Protein-Protein Interactions by Biostatistics Methods

【作者】 胡佳

【导师】 李通化;

【作者基本信息】 同济大学 , 分析化学, 2007, 硕士

【摘要】 蛋白质是生命活动的主要物质承担者,一切生命活动都离不开蛋白质的参与。预测蛋白质的功能和作用机理已经成为当今生命科学界非常热门的课题。许多蛋白质通过与其他蛋白质的相互作用来表达它们的生物学功能,而且蛋白质之间的相互作用在细胞生物学水平上起着十分关键的作用:首先,遗传上的相互功能常常与相应的蛋白质间相互作用有关;其次,在信号传递途径中也需要蛋白质的相互作用;再次,蛋白酶-蛋白质底物间的相互作用与生物的催化反应密切相关;最后,蛋白质的相互作用对于整合如RNA多聚酶或对多成分酶促反应也有至关重要的影响。因此研究蛋白质的相互作用,识别与特定蛋白质相互作用的蛋白质,对于了解蛋白质的功能有着非常重要的意义。 本文首先从DIP数据库中下载得到蛋白质相互作用的数据,并从中筛选出实验所需的正集数据,再结合MIPS数据库中提供的亚细胞定位的分类信息构建负集。我们基于蛋白质的一级结构信息,先采用文献中的CTD编码方法对蛋白质序列进行编码,提取出序列中蕴含的统计特征,用支持向量机(SVM)算法进行建模和预报,平均准确率为79%以上,再采用不同的策略进行变量选择,优化编码后用5-fold交叉验证进行检验,准确率达到了82.43%,比文献的交叉验证结果(76.9%)高出了5%以上。接着,本文采用了另外四种编码方法,从不同的角度对序列进行编码,提取变量,再结合SVM进行预报,结果都比文献值要好。其中预报结果最好的氨基酸双编码的5-fold交叉验证的准确率达到了85.91%,高出了文献值9个百分点。值得一提的是,在另外的这四种编码方法中,氨基酸单编码、氨基酸双编码和伪氨基酸编码以前只用在其他的生物识别问题上。Gauss函数分布编码方法是我们提出的新型编码方法,这种编码方法合理的利用了更多有效信息,预报的效果与氨基酸双编码的结果相近,准确率也达到了85%以上。最后,本文将共识模型引入蛋白质相互作用的预测,选取不同的编码方法建立多个成员子模型,再构建双层结构的SVM融合网络,充分发挥不同编码思想的优点,利用不同模型之间的优势互补关系,从而进一步提高了预测性能,准确率最高达到了86.80%,这是目前据我们所知国际上达到的最佳分类效果。 本文主要分为四个部分:

【Abstract】 Proteins are the primary components of the cellular machinery and it is impossible for body to work without proteins. Nowadays, the prediction of function and principle of proteins is one of the most important topics in the area of life sciences. Many proteins mediate their biological function through protein interactions, and protein interactions are crucial for many aspects of cellular biology. Firstly, genetic interactions often correlate with physical interactions between the corresponding gene products. Secondly, protein interactions are required to tether the components of signal-transduction pathways physically. Thirdly, enzyme-protein substrate interactions are important for catalysis ,and are often found to be more stable than those presumed . Last, protein interactions are crucial for the integrity of multicomponent enzymatic machines such as RNA polymerases and the SPLICEOSOME . Thus, computational prediction of protein interactions has been initiated under the assumption that identification of interaction partners for proteins of unknown function can provide insight into their biological function.Here in my work, the positive dataset is downloaded from Saccharomyces cerevisiae core subset of DIP database. Since a noninteracting protein dataset is not readily available, a hypothetical noninteracting protein dataset is generated based on subcellular localization information which is retrieved form MIPS database and consists of protein pairs that do not colocalize together. At first, with the knowledge of the amino acid sequence each protein sequence is converted into a feature vector using CTD encoding approach. A set of SVMs was trained to predict the protein interactions and the prediction accuracy averaged 79% for the ensemble of statistical experiments.After optimizing the set of parameter vectors by different strategies, the predictive accuracy obtain through 5-fold cross-validation tests is 82.43% ,about 5% higher than the literature. Then we predict protein interactions with the other four encoding approachs. All the result are better than the literature.The predictive

  • 【网络出版投稿人】 同济大学
  • 【网络出版年期】2007年 01期
  • 【分类号】Q51
  • 【被引频次】3
  • 【下载频次】792
节点文献中: 

本文链接的文献网络图示:

本文的引文网络