节点文献
拟南芥AtERFs家族DNA结合特性计算分析及其亚家族DREBs调节靶基因的预测
Computational Analysis of the Specificity of DNA Recognition by AtERFs and in Silico Identification of the Target Gene Candidates of DREBs in Arabidopsis Genome
【作者】 汪世臣;
【导师】 郝东云;
【作者基本信息】 吉林大学 , 生物化学与分子生物学, 2008, 博士
【摘要】 拟南芥乙烯反应元件结合因子(Arabidopsis ethylene responsive element binding factors, AtERFs)是植物特有的一类转录因子。大量研究表明,AtERFs转录因子在植物的生长发育过程中起到重要作用,尤其是在对植物生长逆境的调节中。AtERFs转录因子家族隶属于AP2/ERF超家族,目前在拟南芥全基因组中共发现124个AtERFs转录因子。AtERFs通过其保守的DNA结合结构域特异性识别一类顺式调节元件,如,GCC-box等,从而调节特定基因的转录。为了解析AtERF蛋白和DNA模体(motif)间的相互作用,了解AtERFs各亚家族间DNA结合特性的异同点,本研究从四个亚家族中分别选出一个AtERF蛋白(AtERF1, AtEBP, CBF1和AtERF4),通过同源模建办法,得到AtEBP, CBF1和AtERF4的DNA结合结构域三维构象并使用分子动力学模拟方法,比较分析四个蛋白与GCC-box的相互作用特征。依据这些特征,解析AtERF家族识别DNA模体的结构基础。本文中的结果提供了AtERF蛋白特异性识别DNA序列的详细信息,同时通过比较分析AtERF四个亚家族中各代表蛋白与DNA模体结合的特性,解析了关键氨基酸残基和DNA模体中的关键位点,从而为解释AtERF蛋白的功能分化提供一些有用的信息。探索转录因子的调节靶基因是构建转录调控网络的重要步骤,同时也是理解细胞对环境应答机制的重要线索。转录因子中的DNA结合结构域决定了转录因子对基因的特异性调控。这些结构域使转录因子能够在目标基因上游调节区域中找到特异的DNA基序。理论上来说,全基因组测序完成的物种都可以通过寻找特定DNA基序的办法来直接确定转录因子的靶基因。而通过实验确定一个转录因子的靶基因的工作仍然十分繁琐,尤其是在复杂的基因组序列中。本研究通过分析转录因子结合位点,同时引入机器学习领域的先进方法,提出了高效特异性强的计算预测转录因子靶基因的方法。同时,对DREBs转录因子的调节靶基因在全基因组范围内进行了预测,预测结果为进一步解析DREBs转录因子参与的植物抗逆调节分子网络提供了有价值的参考。
【Abstract】 Arabidopsis ethylene responsive element binding factors (AtERFs) form a transcription factor super family. While the functionality of most AtERFs are unknown, a number of AtERFs are reported to play essential role in regulation of stress-related genes, through binding to a consensus motif GCC-box at the regulatory region by their DNA binding domains, i.e. ERF domains. Phylogenetic analysis of the ERF domains led to a classification of the AtERFs super family into four predominant sub-families.In the first section of this thesis, computational analysis of the structural properties of AtERF-DNA motif complexes was performed. We selected four AtERF proteins, AtERF1, AtEBP, CBF1 and AtERF4, as representatives from each sub-family respectively and constructed four AtERF DBD-DNA complexes through homology modeling. Molecular dynamics simulations were then performed to explore the interactions between the six conserved residues and the DNA motif, GCC-box. By comparing the interactions between the six conserved resides and GCC-box among the four AtERF DBD-DNA complexes, we revealed the common properties of protein-DNA interactions among the AtERFs and the differential roles of each base of GCC-box in specific recognition by AtERFs. Our results suggested that three amino acid residues Arg29, Glu39 and Arg41 played a vital role in direct readout of DNA. The position of the consensus sequence GCCGCC has it intrinsic disparity on binding with ERF domains. The CGNC element in the GCC motif was perhaps compulsory for recognition by ERF domains. Our results provided the structural evidences for the sequence dependent recognition mechanism of AtERFs.The identification of downstream target genes of specific transcription factors (TFs) is necessary in understanding cellular responses to environmental stimuli. Most existing structures of gene regulatory network are highly complicated as it involves cooperative interactions and feedback regulations. The discovery of the direct targets of transcription factors is a fundamental step to elucidate the construction of regulatory networks. Availability of genome sequences made it possible to discover the target genes of a specific transcription factor by looking for the locations of the specific recognition motifs in genome. In practice, however, the task is still difficult due to the complication of plant genomes. During the last decade many computational methods have been developed to identify the target genes of transcription factors successfully. Among the methods, the positional weight matrix (PWM) was the technique most widely used in describing the transcription factors binding sites (TFBS) and scanning the TFBS in the genome scale. However, owing to the looseness of the TFBS’s conservation, these strategies were not capable of effectively identifying TFBS in genome scale. For this reason, the approach, including the PWM and the analysis of TFBS contexts, were developed to overcome the shortage. The fundamental nature of the aforementioned approaches was in fact to develop appropriate algorithms that will describe the properties of the TFBSs and their contexts.In the second section of this thesis, we reported a novel computational strategy to determine the DREB transcription factor binding sites in Arabidopsis genome by combination of the context analysis for the TFBS and machine learning approach.Dehydration responsive element binding proteins (DREBs) are important transcription factors that induce the expression of a series of abiotic stress-related genes and impart stress endurance to plants. They belong to the ethylene responsive element binding factors (AP2-EREBPs) super family of 124 members (so-called ERF proteins), and among which 57 proteins are in the DREB subfamily. The ERF proteins share a conserved DNA binding domain (ERF domain) of 58–60 amino acids that, reportedly, binds to two typical cis-acting elements, that is, the GCC-box, and the C-repeat CRT/dehydration responsive element (DRE) motif and involves in the expression of cold and dehydration responsive genes. It is important to identify the target genes of DREBs in Arabidopsis since the DREBs play a vital role in various types of biotic and abiotic stress responses. Maruyama, et al identified the downstream genes of the DREB1A/CBF3 using two microarray systems. Fowler and Thomashow, Taji et al also reported the downstream genes of DREBs proteins. Nevertheless, the overall target genes of DREBs are yet to be discovered.The differences between the DRE frame sequences (DNA fragments of 206 bp, which were retrieved from the PPRs of MGs, contained a DRE motif (A/GCCGAC) at their center region) and non-DRE frame sequences (DNA fragments of 206 bp, which were collected randomly from the PPRs of Arabidopsis genome, with a DRE motif inserted artificially at their center region) were given focus. A machine learning approach, specifically the support vector machine (SVM) based classifier, was developed to categorize DRE-containing sequences into DFSs and nDFSs. Our results suggested that this algorithm was effective in the discovery of the DREB binding sites in the promoter region of the target genes, so as to infer the target genes of DREBs in Arabidopsis. Furthermore, we predicted 474 candidate genes as the direct targets of DREBs. With Reference to the AtGenExpress microarray data, we achieved the 268 direct targets of DREBs that was inducible by abiotic stress stimuli such as cold, salinity and drought during a 24 hours observation. The results obtained in this study provided the primary information that warranted further experimental investigation regarding the anti-stress regulatory network of DREBs in plants.
【Key words】 transcription factors; DNA binding; molecular dynamics; target gene; machine learning; Arabidopsis;