节点文献

蛋白质二硫键结构特征与序列关系的生物信息学研究

Bioinformatics Studies on the Relationship between Disulfide Structural Feature and Sequence in Proteins

【作者】 宋江宁

【导师】 须文波;

【作者基本信息】 江南大学 , 发酵工程, 2005, 博士

【摘要】 二硫键是由蛋白质的两个半胱氨酸之间配对形成的一种共价键,可以存在于同一条蛋白质多肽链内,也可以存在于不同的多肽链之间。对于许多蛋白质而言,二硫键是它们最终折叠产物的永久特征。二硫键的形成是蛋白质折叠过程中的重要步骤,其形成动力学影响蛋白质折叠的速率和途径,它的错误配对是影响蛋白质多肽链正确折叠的重要原因。二硫键的存在对于维持蛋白质空间结构稳定性,保持其生理活性具有至关重要的意义。研究形成二硫键的蛋白质序列与结构特征,找出与二硫键形成有关联的某些结构信息,对于蛋白质工程和人工药物分子设计都有着积极而重要的意义。本论文以蛋白质二硫键作为研究对象,利用生物信息学这一新兴前沿交叉学科的研究方法和工具,综合运用数学,物理学,生物学和计算机科学知识,通过构建高精度高可靠性的蛋白质二硫键空间结构数据库和大肠杆菌蛋白质二硫键与对应基因序列关联数据库两类数据库,从二硫键蛋白质的基因序列、氨基酸序列和三维空间结构等三种水平上对蛋白质二硫键形成的结构特征和序列之间的关系进行较为系统和深入的研究。研究的主要内容如下:(1)高质量蛋白质二硫键空间结构数据库的构建是进行蛋白质二硫键统计计算分析的基础。按照分辨率小于0.25nm,且序列同一性(sequence identity)小于30%的原则从PISCESCulled PDB数据库中选取高精度高可靠性的蛋白质空间结构数据,在此基础上,挑选含有SSBOND记录的PDB结构数据,通过严格的结构数据文件形式错误检验、序列自洽性检验、SSBOND记录准确性检验以及SBOND成键记录校正,删除其中包含的错误和可疑数据,成功建立一个高质量的蛋白质二硫键空间结构数据库。研究蛋白质折叠与蛋白质编码序列关系问题离不开高质量蛋白质结构及其对应基因序列数据。通过查询SWISS-PROT数据库中E. coli的蛋白质,得到不同数据库中的蛋白质结构与基因序列的交叉索引表,在此基础上,删除大量冗余及不可靠数据,最后得到一个高精度大肠杆菌蛋白质结构与对应基因序列数据集-EcoPDB,这是研究蛋白质空间结构数据与核酸序列数据之间对应关系的基础数据库。(2)研究蛋白质二硫键的形成特征和序列结构特征,对于进一步研究蛋白质二硫键的形成、与氨基酸序列之间的关系,预测半胱氨酸的二硫键形成状态以及蛋白质二硫键辅助折叠动力学具有重要作用。CYS的氧化还原状态表现出一种明显的协同性现象:蛋白质序列中若有二硫键形成,那么此序列中的所有CYS或者大部分CYS倾向于采取氧化状态;二硫键在蛋白质序列中的分布很不均匀,大部分二硫键都是在序列距离小于70个氨基酸的地方形成的,存在着二硫键形成的强烈偏好序列距离,如序列距离为11,6,16,5,13个氨基酸处;相对而言,二硫键更倾向于在氨基酸序列的前半段出现,这在蛋白质翻译过程中对于保证新生肽链顺利延伸合成和减少误折叠发生有着积极意义。比较氧化态CYS和还原态CYS周围的氨基酸分布情况,发现两者周围氨基酸分布有着比较明显的差异:前者周围

【Abstract】 Disulfide bonds are primary covalent crosslinks between cysteine side chains which can existeither in the same protein polypeptides or among different protein polypeptides. For manyproteins, disulfide bonds are the perpetual characteristics of their ultimate folding products. Thecorrect formation of disulfide bonds is the crucial step in the folding pathway and the kinetics ofdisulfide formation can dominate the rate and pathway of protein folding. The mispair of disulfidebonds is the important reason resulting from the incorrect folding of protein polypeptides. Suchbonds play important roles in stabilizing protein spatial conformation and ensuring that proteinwill perform its biochemical function.The systematic Bioinformatics studies on the relationship between the disulfide structuralfeatures and sequences in proteins would have potentially important applications both in proteinengineering and rational molecular drug design, such as in introducing engineered disulfide bondsto increase the conformational stability of proteins and helping locate disulfide bridges to aidthree-dimensional structure predictions. In this paper, disulfide bonds in proteins were selected asthe research subject by utilizing the common algorithms and tools in Bioinformatics and theknowledge of mathematics, physics, biology and computer science. After successfullyconstructing the high quality protein disulfide structural database and the high quality database ofEscherichia coli gene sequences and corresponding protein structures, the systematicbioinformatics studies were carried out to explore the relationship between structural features ofdisulfide bond formation and sequences in proteins based on three levels: the gene codingsequence, the amino acid sequence and three-dimensional spatial structure. The main contents ofthis dissertation follow:(1) Construction of protein disulfide structural database with high quality was the basis of thestatistical analysis and computation of disulfide bonds in proteins. According to the principles ofthe resolution higher than 0.25nm and sequence identity less than 30%, the protein structural datawere selected from the PISCES Culled PDB to constitute the raw database. Based on this dataset,a large disulfide bond database with high quality was constituted after the strict structural data fileformat test, sequence consistency test, SSBOND record veracity test and SSBOND recordemendation by eliminating the inaccurate and questionable data. The high quality database ofEscherichia coli gene sequences and corresponding protein structures was essential forinvestigating the relationship between protein folding and protein coding sequence. By queryingabout Escherichia coli proteins in SWISS-PROT, a cross-reference table of the protein structuresand their corresponding gene sequences in different databases was obtained. After removing alarge amount of redundant and uncertain data, a high quality dataset-EcoPDB was finallyconstructed, which was a fundamental dataset of understanding the relation between proteinspatial structure data and nucleic acid sequence data.(2) The formation features and sequence distribution features of disulfide bonds in proteinshave an important effect on the further investigation of the formation of disulfide bonds, therelationship between disulfide bonds and amino acid sequences, the prediction of disulfidebonding states of cysteines and the folding dynamics assisted by disulfide bonds. The resultsindicated that the oxidation states of cysteines showed an obvious cooperation phenomenon thatalmost all cysteines in the same protein were oxidized if this protein contained disulfide bonds.The distribution of disulfide bonds in protein sequences was rather uneven that most bonds wereformed between the two cysteines with close sequence distance less than 70. The results alsoindicated that there existed some strong preference for some certain sequence distances, such as11, 6, 16, 5 and 13. Disulfide bonds were inclined to form in the front part of the amino acidsequence comparatively, which had a positive meaning in ensuring the prolongation andformation of the newly polypeptide chain and lowering the mis-folding in protein folding process.It was shown that amino acid distribution of protein sequences flanking the oxidized cysteinesand reduced cysteines had distinct differences: In the case of oxidized cysteines, the occurrence ofthe hydrophobic and polar residues was higher, while in the case of reduced cysteines, the contentof strongly hydrophobic and charged residues was higher. The amino acid residues in differentpositions flanking the centered oxidized cysteines made different contributions towards disulfidebond formation. Some certain residues in certain positions had strongly positive effect on theformation of disulfide bonds, while other residues in certain positions showed high negativeinclination to disulfide bond formation.(3) A novel approach was introduced to predict the disulfide-bonding states of cysteines inproteins by means of a two-class linear discriminator based on their amino acids and dipeptidescomposition. The results demonstrated that the cooperativity phenomenon exhibited by theoxidation of cysteines could be well described by the compositions of 20 amino acids and 400dipeptides in proteins. Based on the contents of 20 amino acids, the prediction accuracy of theoxidation form of cysteines scored as high as 85.2% on cysteine basis and 81.2% on protein basis,respectively, by using the rigorous jack-knife procedure. The prediction performances of oxidizedcysteines and reduced cysteines were Qoxi=89.9% and Qred=71.0%, respectively. The Matthew’scorrelation coefficient MCC was 60.6%. Based on 400 dipeptide compositions, the predictionaccuracy of this classifier achieved up to Q2=89.1% on cysteine level and Q2prot=85.2% onprotein level, evaluated by the rigorous jack-knife test. The accuracy rates of oxidized cysteinesand reduced cysteines were Qoxi=92.2% and Qred=79.3%, respectively. The Matthew’scorrelation coefficient MCC was 70.7%. It was shown that whether cysteines should formdisulfide bonds depends not only on the global structural features of proteins but also on the localsequence environment of proteins. The results also demonstrated that the application of this novelmethod based on amino acid and dipeptide compositions could provide comparable predictionperformance compared with existing methods for the prediction of the oxidation states ofcysteines in proteins.(4) A novel approach was proposed to predict the disulfide-bonding states of cysteines inproteins by constructing a two-stage classifier combining a first global linear discriminator basedon their amino acid composition and a second local support vector machine classifier. The resultsindicated that the new hybrid classifier had relatively higher prediction accuracy for thedisulfide-bonding states of cysteines. When Qc=-0.1 was selected, the overall predictionaccuracy could be improved to Q2=84.1% on cysteine level and Q2prot=80.1% on protein level, ,respectively, by using jack-knife procedure. The accuracy rates of oxidized cysteines and reducedcysteines were Qoxi = 87.8%, and Qred = 77.8%, respectively. The Matthew’s correlationcoefficient MCC was 62.2%. This finding indicated that the formation of disulfide bonds bycysteines was determined by the global structural feature of proteins, as well as the local sequenceenvironment of cysteines.(5) The correlation between cysteine synonymous codon usage and its flanking amino acidresidues, and the correlation between cysteine synonymous codon usage and disulfide bondformation of cysteines were investigated in the whole E. coli genome by using a novel methodbased on information theory and statistical learning theory. It was found that lysine in position -7,tryptophan in position -6, tryptophan in position -1, methionine and glutamic acid in position +1had a great influence on cysteine synonymous codon usage by computing the I m( cys | a )values oftwenty amino acid residues flanking both the C-terminal and N-terminal of cysteines in E. coligenome sequences. By computing the Shannon Entropy values of cysteine synonymous codons inthe high quality database of E.coli gene sequences and corresponding protein structures-EcoPDB,it was found that cysteine synonymous codons do contain some factors influencing the disulfidebond formation. As far as the E.coli Genome was concerned, the correlation between cysteinesynonymous codon usage and disulfide bond formation may be a kind of regulation to proteinstructures on the gene sequence level. The discrepancy of its synonymous codon usage should beconsidered as the reflection of biological function restriction resulted from disulfide bondformation of cysteines.(6) A method was developed for the classification prediction of protein spatial structures andfor protein structure search homology search based on disulfide bonding patterns. It wasapplicable to determine the target protein’s structural classification and search related proteinswith similar disulfide information by using disulfide bonding patterns. By computing the disulfidebonding patterns in the protein disulfide structural database and analyzing the correlation betweenprotein disulfide patterns and related protein homologous structures, six detailed cases had beenillustrated and highlighted to demonstrate that it was possible to use single disulfide bondingpatterns instead of the complete protein amino acid sequences to discriminate and classify theprotein structure folds. The results also indicated that proteins with the same disulfide bondingpatterns usually belong to the same structural family or superfamily in the StructuralClassification of Proteins (SCOP) database, which commonly have similar biological functions.

  • 【网络出版投稿人】 江南大学
  • 【网络出版年期】2006年 09期
  • 【分类号】Q51;Q811.4
  • 【被引频次】2
  • 【下载频次】1123
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络