节点文献
蛋白质分子中RNA结合位点的分析和预测
Analysis and Prediction of Rna-binding Residues in Protein Molecules
【作者】 查磊;
【作者基本信息】 中国人民解放军军事医学科学院 , 生物化学与分子生物学, 2012, 博士
【摘要】 蛋白质与RNA的相互作用广泛存在于RNA剪切、翻译、病毒的复制以及细胞中的其它生物学过程中。因此,探讨蛋白质与RNA相互作用并确定蛋白质中与RNA结合的氨基酸残基,对于理解蛋白质与RNA之间的相互作用机制具有重要意义。目前,对这一问题的研究主要从实验与生物信息学两个方面入手。从实验上看,主要是通过X射线晶体衍射、核磁共振等方法得到蛋白质与RNA复合物的三维结构信息。基于三维结构信息进一步确定与蛋白质中与RNA相互作用的氨基酸残基。实验的优点是结果可靠,缺点在于时间和经费方面的花费较大,并且在具体实施时,还面临着不少实际问题。例如,某些蛋白质-RNA复合物结晶很难获得。随着蛋白质结构数据的增多,研究人员开始尝试从生物信息学角度出发对这个问题进行研究与分析。从研究途径看,可以分为通过RNA结合结构域来判定、通过分子动力学模拟来判定以及通过统计分析或者机器学习方法来判定这3个方面。对于结构域方法来说,我们可以通过SCOP等蛋白质结构数据库搜索并确定蛋白质中RNA结合结构域所在的位置,从而大致确定该蛋白质与RNA的结合位点。但结构域方法的缺陷在于仅适用于已测定了RNA结合结构域的蛋白质。此外,目前对RNA结合结构域的作用机制尚未完全阐明,存在着结构域中的氨基酸残基不与靶标RNA区域结合,而是结合到其它区域,甚至导致该蛋白质结合到另一个蛋白质上的情况。另一种寻找RNA作用位点的方法是分子动力学模拟,该方法能够较为直观的观察到蛋白质与RNA的结合过程,以及这个过程中一些能量和构象上的变化。但该方法的缺陷在于模拟耗时较长,仅适用于小规模体系。此外,各种参数的设定也对模拟结果的正确性有影响。从目前来看,如果从生物信息学角度出发对该问题进行研究,比较适合的途径是通过提取各种特征,利用机器学习方法构建模型来判别。随着近几年,特别是2005年以来,蛋白质-RNA复合物三维结构数据的增多,为采用这一方法对该问题进行研究提供了数据基础,逐步有研究人员基于这些数据,开始了相关的研究工作。然而,先前的研究存在着如下缺陷:①数据量小,有的工作仅基于十多例数据,得到的结论可能存在偏性;②关注的特征较少,有的工作仅关注某一个或者某一方面的特征;③有些特征需要从三维结构数据中获取。或者是比较复杂、难以计算的理化特征,限制了其应用。针对以上研究存在的缺陷,我们认为,目前还缺少这样一个结合RNA的氨基酸位点预测模型:①该模型基于大规模数据构建,以避免得到可能存在偏性的结论;②构建模型时应该综合考虑各种特征,以提高其分类精度;③构建该模型所依赖的特征,应该能够从蛋白质序列中获取或者计算得出,这样构建的模型才具有实用性。为此,我们首先从PDB数据库中提取了截止至2011年6月,所有的经X射线晶体衍射测定的、分辨率大于3、仅含有蛋白质与RNA的复合物数据,共计532个。去除90个核酸序列过短(长度小于等于4)或者序列存在错误的复合物数据后,剩余442个复合物数据,共包含1970条蛋白质序列以及823条RNA序列。由于蛋白质相似性较高,为了避免数据冗余,我们采用BLASTClust程序对其中的蛋白质序列按照序列相似性不超过25%的阈值进行了聚类。BLASTClust将其聚为了429类。对于每一类,我们选择其中的第一条序列作为该类的代表。这样我们得到了429条蛋白质序列,共包含90735个氨基酸残基。我们基于PDB结构数据,采用距离定义法来确定与RNA相互作用的氨基酸位点。即对于蛋白质与RNA复合物中的某个氨基酸,如果其包含的至少一个原子与某个RNA碱基所包含的任何一个原子之间的距离小于3.5,那么就认为该氨基酸与RNA相互作用。这样,在90735个氨基酸中,有10525个被判定为RNA结合位点,其余的80210个被判定为非结合位点。确定作用位点后,对于每一个氨基酸,我们分别提取如下9大类共计150个特征:①氨基酸所包含的原子数;②氨基酸所带的静电荷值;③氨基酸所包含的氢键数目;④侧链pKa值;⑤疏水性;⑥相对溶剂可及表面积比例;⑦氨基酸所处的二级结构构象;⑧改进的PSSM矩阵;⑨基于偶极矩与侧链体积的氨基酸分类。我们利用我们编写的TClass分类程序,基于Na ve Bayes方法和前向特征选择策略自动进行特征筛选与模型构建。以留一交叉法验证精度为目标来判定所构建模型的性能。此外,我们还采用了属性bagging方法进行了集成学习,以进一步提高模型性能。在独立测试集上的测试表明,我们所构建的模型分类精度为83.46%,特异性为84.33%,敏感性为78.55%,具有较好的性能。我们还将我们构建的模型应用于分析Xlrbpa蛋白,所预测出的结合位点与其经过实验测定的RNA结合位点有很好的重叠,证实了我们工作的实际有效性和可应用性。通过分析各种特征的取值与RNA结合偏好性之间的关系,我们发现了如下结论:①RNA对氨基酸具有强烈的结合偏好性,最受欢迎的氨基酸(Arg)比最不受欢迎的氨基酸(Cys)在出现次数上相差38倍;②亲水性氨基酸(如:Arg、Lys等)比疏水性氨基酸(如:Cys、Met等)更受欢迎,两者之间在出现次数上相差4.38倍;③从R基团极性和静电荷值看,极性带正电荷的氨基酸(例如:Arg、Lys)最受欢迎,非极性的氨基酸最不受欢迎(例如:Trp、Met、Phe等);④从基于偶极矩与侧链体积的氨基酸分类来看,偶极矩大于3.0德拜,侧链体积大于503的氨基酸(如:Asp、Lys)比较受欢迎,而偶极矩大于3.0德拜,但方向相反的氨基酸(如:Asp、Glu),不受欢迎。为了方便相关实验人员的使用,我们基于所构建的分类模型,利用MATLABBuilder JA、MySQL和JSP开发了在线预测服务器RBRPre。用户仅需提交蛋白质序列,即可通过Email获得该蛋白质序列上结合RNA的氨基酸位点。同时,我们还通过crontab定时机制与MySQL数据库,对预测任务进行排队与调度,避免了可能的高并发预测任务带来的拥堵。综上所述,该研究通过大规模的收集蛋白质-RNA相互作用数据,较为综合地考虑了相关特征,构建了蛋白质中结合RNA的氨基酸位点预测模型,分析并发现了各种特征的取值与RNA结合氨基酸偏好性之间的关系。经过独立测试集测试和相关实例研究表明,该预测模型具有较好的性能。基于构建的预测模型,我们还开发了在线预测服务器RBRPre。这一工作的开展,获得了如下结果:①使得相关研究人员能够在仅知道蛋白质序列的情况下,获得比较可靠的蛋白质与RNA相互作用的位点信息,并且具有较高的可靠性。②通过分析氨基酸各种特征对RNA结合偏好性的影响,为蛋白质-RNA相互作用机制的研究提供了有用的信息。③构建了在线预测服务器RBRPre,为研究人员获得与RNA结合的蛋白质位点提供了较好的生物信息学支持,加快了实验进程。该工作的特色及创新点在于:①该模型的构建,使得相关研究人员在仅知道蛋白质序列的情况下,就可以获得较为可靠的RNA结合位点,具有很强的实用性。同时,所构建的在线预测网站RBRPre,能够为相关研究人员提供方便快捷的预测服务。②数据规模较大,特征覆盖较为全面,避免了小规模数据基础上得到的可能有偏性的结论,独立测试集及实例分析表明,预测结果真实可靠。③确定了20种氨基酸结合RNA的偏好性,以及确定了氨基酸不同特征对结合RNA偏好性的影响。
【Abstract】 Protein-RNA interaction plays an important role in many biological processes,such as RNA splicing, translation, protein synthesis and posttranscriptional regulation.Therefore, identification of RNA-binding residues in proteins provides valuableinformation for understanding the mechanisms of Protein-RNA interaction.The present approaches to study protein-RNA interaction can be divided intoexperimental methods and bioinformatics methods. The experimental methods, suchas x-ray crystallography or nuclear magnetic resonance, can be applied to deduce thecrystal structure of Protein-RNA complex based on which the RNA-binding residuescan be found. The advantage of experimental methods is that the result is reliable.However, the processes to obtain the crystal structure of Protein-RNA complex is atime-consuming, and sometimes, it is difficult to get the crystal structures for someProtein-RNA complexes.With the increasing of structure data of protein-RNA interaction, researchers havebeen trying to find RNA-binding residues through bioinformatics methods, which aremainly classified into three categories, structural domain methods, moleculardynamics simulations and machine learning methods. The core idea of the structuraldomain methods is to find RNA-binding residues by searching the position ofRNA-binding domain in protein structure databases such as SCOP. However, thismethod can only be used for those proteins that have been determined RNA-bindingdomain. In addition, the mechanism of RNA-binding domain is not very clear yet.Sometimes, residues in RNA-binding domain will interact with other regions of RNAinstead of target region, even with other proteins. Another way of findingRNA-binding residues is molecular dynamics simulations. By simulation, we canobserve the whole binding progress and determine the change of energy andconformation during the progress. The first drawback of simulation methods is that itis a long time job, and only available for small systems. The second one is that thecorrectness of simulation is affected by parameter setting. Sometimes, it is verydifficult to find out the optimized parameters. However, with the accumulation oflarge amount of structure data, it becomes possible to find RNA-binding residues bymachine learning methods, and some models have been proposed to predictRNA-binding residues recently. Though fully analyzing those models, we found thatthere existed some shortcomings of those models as follows. Firstly, the number oftraining samples is small, which may lead to a bias result. Secondly, the number offeatures is small,some works only considered several features, which may misssome important key variables. Thirdly, some models are developed using the featuresextracted from3D structure data, or complex physical chemistry features, whichcannot be applied to those protein sequences without3D structures.To solve these problems,we need a prediction model satisfying the following characteristics:1. The model should be developed based on a big dataset to avoidbias;2. In order to improve the prediction performance, more features should beextracted; and3. Features that are selected to develop the prediction model should bederived only from sequence information. To this end, we have developed the models.Firstly, we extracted532Protein-RNA complex samples from PDB databasereleased before June,2011. These complexes were derived from x-ray crystallographywith the resolution greater than3, and only contain protein and RNA sequences.After removing90samples, which have a RNA chain shorter than4nucleotides orhave mistakes in sequence data, we get a dataset contains429samples, which contain1970protein sequences and823RNA sequences. In order to reduce data redundancy,protein sequences are clustered into429groups by BLASTClust with sequenceidentity above25%. The first sequence of each group is selected as the representativeof this group. After that, we get429non-redundant protein sequences, which contain90735amino acid residues.The binding sites are defined by distance between atoms: if one of the atoms ofan amino acid residue falls within a cut off distance of3.5from any atoms of RNAmolecule in the complex, the residue is designated as a binding site. In the datasetconsisting of90735amino acid residues, we find10525binding residues and80210non-binding residues.After defined the binding sites, each amino acids residue is characterized by nineclasses of features:①the number of atoms;②the number of electrostatic charge;③the number of potential hydrogen bond;④side chain pKa value;⑤hydrophobicindex;⑥relative accessible surface area;⑦secondary structure;⑧smoothed PSSM;⑨classification of amino acids based on dipole moment and side chain volume.Finally, we applied TClass program to select features and construct predictionmodel by combining Na ve Bayes classification methods and forward featureselection strategy. Furthermore, attribute bagging method is used to improve classifierperformance. Test on independent dataset shows that the classifier achieves83.86%overall accuracy with83.32%sensitivity and80.55%specificity. A case study ofXlrbpa protein shows that, there is a good overlap between the positions predicted byour model and those determined by RNA-binding domain.By analyzing the relationship between propensities of amino acid usage and thefeatures, we get the following results:①RNA shows a strong bias on amino acidselection, the occurrence number of most popular amino acid is38times than themost unpopular amino acid.②Hydrophilic amino acid is more popular thanhydrophobic amino acid. The occurrence number of hydrophilic amino acid is4.38times higher than hydrophobic amino acid.③Positively-charged polar amino acid ismore popular than non-polar amino acid.④The amino acid residue, whose dipolemoment is bigger than3.0debay and side chain volume is bigger than503, is morepopular with nucleotides. The amino acid whose dipole moment is bigger than3.0debay but has opposite orientation is unpopular with nucleotides.Based on the prediction model we developed, we build an online predictionserver called RBRPre that powered by MATLAB Builder JA, MySQL and JSP. Usercan visit the website and input a protein sequence. Then, the prediction result will be sent to user via Email. In order to avoid crash caused by high-concurrence visit, werealized a queue scheduling algorithm by MySQL and crontab.In summary, based on a big dataset of Protein-RNA complex and lots of features,we developed a RNA-binding residue prediction model and analyzed the relationshipbetween propensities of amino acid usage and the features. Test result on independentdataset and the case study of Xlrbpa protein show that the prediction model achievesgood performance.Based on our work, we can get these results:①This work makes it possible toget the RNA-binding residues only by sequence information.②This work providesvaluable information for understanding the mechanism of Protein-RNA interactionthrough the analysis of relationship between propensities of amino acid usage and thefeatures.③By construction the online prediction server RBRPre, this work providesa better bioinformatics support for searching RNA-binding sites in protein,and speedup the progress of related experiments.The innovation points in this paper lie in:①With this model, researchers can getRNA-binding residues in proteins based only on sequence information. The onlineprediction tool, RBRPre, provides an easy-to-use service for relevant researchers.②Based on a big dataset and lots of features, we can get a reliable result with out bias.③The bias of amino acid selection on RNA-binding sites is analyzed in this paper,The relationship between amino acid features and RNA-binding bias is also analyzed.
【Key words】 Protein-RNA interaction; binding site; prediction model; bias;
- 【网络出版投稿人】 中国人民解放军军事医学科学院 【网络出版年期】2012年 10期
- 【分类号】Q75
- 【下载频次】343
- 攻读期成果