节点文献
蛋白质Beta折叠的分析与预测及生物信息工具开发
Analysis and Prediction of Beta-sheet Structures in Proteins and Bioinformatics Software Tools Developing
【作者】 张宁;
【导师】 张涛;
【作者基本信息】 南开大学 , 生物信息学, 2010, 博士
【摘要】 β折叠是一种重要的蛋白质二级结构类型之一,也是影响蛋白质结构预测精度的主要因素之一。对β折叠结构的深入研究和精确预测能够在很大程度上提高蛋白质结构预测的准确率,并对蛋白质折叠和蛋白质设计研究有重要的推动作用。本文就重点对β折叠结构进行研究。研究使用来自PISCES服务器的一个数据集。在对数据进行前处理时,改造和完善了我们前期工作中构建的SheetsPair数据库,并将PISCES的数据集整合到SheetsPair数据库中,后续的研究就通过该数据库管理数据。对β折叠结构的研究,遵循从β股间氨基酸配对出发到β股肽链配对的路线。首先对β股间的氨基酸配对进行了统计分析。结果表明,股间氨基酸配对不是随机的,而从整体上表现出一种明显的配对亲和倾向。基于统计结果,还分别得到了平行折叠、反平行折叠和总体β折叠的反映氨基酸配对偏好性的相对频率矩阵,这些矩阵成为我们后续研究的基础。分析还发现疏水作用和二硫键是影响氨基酸配对的两种主要因素,此外尚有其他因素(如周围环境)可能也影响氨基酸配对。平行折叠和反平行折叠的氨基酸配对偏好性也不相同。然后基于计量多维尺度(MMDS)的方法,对氨基酸配对偏好性进行了分析。通过MMDS的方法,将相对频率矩阵中反映的氨基酸配对的主要特征以图形方式直观地展示出来。在平行折叠、反平行折叠和总体β折叠的MMDS图中都可以看到有一个明显的氨基酸聚集“核心”,位于“核心”的氨基酸主要是疏水性较强的氨基酸,说明了疏水作用在β折叠结构中的重要性。通过MMDS分析,也发现了平行折叠和反平行折叠的氨基酸配对亲和性的差异,这为今后开发预测区分平行折叠和反平行折叠的算法打下了基础。基于MMDS分析的结果,并结合分层聚类的方法,还提出了一种对20种氨基酸聚类降维的方式:总体上将20种氨基酸聚为5类最优,而单独考察平行折叠时聚为6类最优,单独考察反平行折叠时聚为4类最优。在前面对β股间氨基酸配对分析的基础上,下面考察β股肽链的配对和排列。从直观上讲,β股的配对排列至少应包括三个方面的研究内容:(1)确定配对关联,即确定组成β片层的各条β股的两两配对关系;(2)预测配对的两条β股的相对方向(平行或反平行);(3)确定配对的两条β股的相对位置。我们的研究就围绕这三个方面分别展开。首先重点考察了第(2)方面,即配对β股的相对方向(平行或反平行)。基于前面分析得到的氨基酸配对相对频率矩阵,分析了氨基酸配对与β股排列方向的关系。结果表明,股间氨基酸配对与β股的平行/反平行的排列方向具有十分显著的相关性,股间氨基酸的相互作用在β折叠形成的平行/反平行排列方向的确定上起到了重要的甚至是决定性的作用,而环境因素和其他不确定因素在这方面的影响较小。我们从这个结论出发,采用一种新的编码方式,并基于支持向量机(SVM)开发了一种预测β折叠平行/反平行排列方向的方法。结果表明,该方法可获得比较高的预测准确率(86.89%的准确率和0.7126的Matthew系数值)。在第(1)方面,对β股配对关联规律进行了初步研究,发现β折叠股配对关联较多地表现出一种邻近配对倾向(“先来先配”倾向)。在反平行折叠中,相邻β股的配对还有对氨基酸距离的较强偏好性;而在平行折叠中,这种偏好性较弱。在第(3)方面,发现组成β片层的β股肽链在两两配对排列时,其末端并不一定彼此对齐,而往往出现一定的“延伸末端”。通过对延伸末端的统计分析表明,配对部分的长度占延伸长度(延伸长度是配对部分长度与两端的延伸末端长度之和)的比例一般要超过25%,配对部分的长度占β折叠股长度的比例一般要超过40%。基于研究实践中摸索和积累的许多生物信息学研究经验,我们开发了一些软件或工具,可为包括β折叠在内的许多生物信息学研究带来便利。这些工具主要有:用于β折叠股间氨基酸配对可视化的StrandPairsViewer软件、用于生物大分子序列关系动态绘图和可视化分析的SRD软件、用于时间序列数据读取和展示的NRChart控件(ActiveX控件)、用于膜片钳数据前处理的PCDReader软件、用于长时程增强(LTP)实验数据文本转换的LTPConverter工具、用于日常生物信息通用纯文本处理的超级记事本软件等。其中对许多软件和工具都在其性能优化上做了大量工作(提高运行速度、减少占用内存等)。文中对软件的特点、主要功能、以及主要的程序设计技术、方法技巧等进行了介绍。
【Abstract】 The (3-sheet is one of the most important protein secondary structures, and has remained one of the main stumbling blocks of protein structure predictions. An in-depth study and an accurate prediction of (3-sheet may lead to noticeable improvements in de novo protein structure prediction and in the study of protein folding and design. In this study, we mainly explored theβ-sheet structure.The dataset used was taken from the PISCES server. Based on our SheetsPair database constructed previously, we prepaired all proteins in the PISCES dataset and integrated them into the database. And then the database was used to manage all the protein data for our further studies.We pursued a research strategy from the interstrand amino acid pairs to (3-strand (peptide segment) arrangement. First of all, statistical analysis had been done on the amino acid pairs and non-random appetency propensities had been revealed. Based on the statistical results, three relative frequency (RF) matrices were obtained for parallel, antiparalllel, and total P-strands, respectively. These matrices were then used widely in our further studies. It was shown that the hydrophobic strength and the disulphide forces were the two main factors influencing the interstrand amino acid pairs. Additionally, it seemed that other aspects (such as surroundings) could also contribute to the pairing. Furthermore, analysis results revealed that there were noteable differences in the amino acid pairing preferences between parallel and antiparallelβ-strands.We then analyzed the amino acid pairing preferences based on the method of metric multi-dimensional scaling (MMDS). The MMDS method was used for making a visual representation for the RF matrices representing the interactions between amino acids. As the MMDS maps showed, there was a distinct "core" constructed mainly by strong hydrophobic amino acids on each map of parallel, antiparallel and totalβ-strands, respectively. This indicated again the importance of the hydrophobic strength in the amino acid pairs. Another found was that the MMDS maps for parallel and antiparallelβ-strands were different, which could be used in our further study to develop methods for predicting parallel and antiparallel orientation. We also use a hierarchical clustering method on our MMDS results to group the 20 amino acids. It arrived at an optimum number of 5 groups for total, but 6 for parallel and 4 for antiparallel.From the results on the analysis of the amino acid pairs above, we then investigated theβ-strand (peptide segment) arrangement. At the most straightforward level, full (3-strand arrangement could consist of:(i) finding the interacting partnerβ-strand(s), (ii) predicting the relative orientation (i.e. parallel or antiparallel) and (iii) shifting the relative positions of the two interactingβ-strands. Our further studies were performed according to these three aspects.First of all, we mainly focused on the second aspect of the three above, i.e. the parallel or antiparallel orientation. By extracting features from the RF matrices, we found that the interstrand amino acid pairs played a significant role in determining the parallel or antiparallel orientation ofβ-strands, and the influences of the surroundings and other uncertain factors were small in this aspect. From these conclusions, we proposed a new encoding scheme and developed a support vector machine-based approach for the prediction of the parallel/antiparallel orientation ofβ-strands. As a result, a prediction accuracy of 86.89% and a Matthew’s correlation coefficient value of 0.7126 had been achieved.In the first aspect of the three above, we preformed a preliminary study on the strand partner distribution. Results showed that most P-strands inclined to part with its neareast neighbour strands (or "First Come First Pair" rule). Furthermore, neareast neighbour paired P-strands had more strong preferences in amino acid distances in antiparallel, but it was not so strong in parallel.In the third aspect of the three above, it was found that the ends of one P-strand did not align with the ends of another, but extend a part of it, when they arranged to form aβ-sheet. Statistical results showed that the ratio of the length of the paired part to the extended length (the extended length is the length of paired part plus lengths of two extending parts) was more than 25%, and the ratio of the length of the paired part to the length of the P-strand was more than 40%, generally. In the present study, there has been a lot of research in field of bioinformatics. From our experiences and techniques, we developed several software or computer utilities to facilitate the future studies ofβ-strands and studies of other fields of bioinformatics. Such software or computer utilities are as following: StrandPairsViewer software for interstrand amino acid pairs visualization, SRD software for DNA/Protein sequence relationship visualization based on undirected graphs, NRChart control (an ActiveX control) for time series data reading and visualization, LTPConverter tool for long-term potentiation (LTP) experiments data conversion, Super Notepad software for ASCII text processing for daily bioinformatics research, etc. Many efforts had been done to make these software or computer utilities run faster and occupy less memory. The features, appplications, programming methods and techniques of them have been presented in the dissertation.
【Key words】 protein; beta-sheet structure; amino acid pairs; beta-strand arrangement; multi-dimensional scaling; support vector machine; database; software development;