节点文献

植物基因相关SSR序列调控及位点多态性研究

Regulation and Polymorphism Analysis on Gene-associated Ssrs in Plant

【作者】 张利达

【导师】 唐克轩;

【作者基本信息】 上海交通大学 , 生物医学工程, 2010, 博士

【摘要】 简单重复序列(Simple Sequence Repeat, SSR)作为一类短串联重复序列广泛分布于植物基因组。传统观点认为SSR序列倾向存在于植物基因组重复区域,但随着大量植物基因组及其表达序列的测定,发现SSR的分布情况比原先认识的要复杂得多,SSR序列在植物基因组中并非集中于基因组重复区域,而是更倾向分布在基因组单拷贝或低拷贝区域,尤其在基因上游调控区大量出现。出于对SSR序列属于基因组的“垃圾”DNA,并不具备重要生物学功能的片面认识,长时间以来SSR序列的生物学功能未引起重视,但SSR序列在植物基因调控区超乎寻常的积累应该具备环境适应优势。开展植物基因组调控区大量SSR序列的功能研究,对于突破SSR序列只作为一种优良分子标记的认识局限,加深理解SSR序列在植物基因组中的生物学功能具有重要意义。分布于拟南芥基因组调控区的SSR序列主要由GA/CT和GAA/CTT重复类型组成,约占调控区SSR位点总数的60%。一些SSR位点作为重要的功能元件涉及基因调控,这些功能位点在进化过程中应该具有不同程度的序列保守性。本研究采用系统发生足迹技术对拟南芥基因组内同源基因调控区、拟南芥基因组与甘蓝基因组的同源基因调控区CT/GA和CTT/GAA重复类型的SSR序列进行保守性分析,试图发掘具有调控功能的非编码保守SSR序列或非编码保守微卫星序列(Conserved Noncoding Microsatellite Sequence, CNMS)。结果发现在拟南芥和甘蓝基因组分化过程中247个SSR位点呈现位点保守性,包括182个CT/GA和65个CTT/GAA重复序列位点。同样,分析拟南芥旁系同源基因,发现位于基因调控区的122个CNMS位点,包括78个CT/GA和44个CTT/GAA重复序列位点。总计491个CT/GA和CTT/GAA重复序列位于拟南芥基因组调控区存在保守性,占整个拟南芥基因组上游调控区500 bp区域同类型SSR序列的10.6%。比较分析3组不同类型随机数据,排除了CNMS位点出现在同源基因调控区的偶然性,进一步确认位于拟南芥同源基因调控区的部分SSR序列起源于同一祖先位点,在进化过程中具有很高的序列保守性。为深入了解拟南芥CNMS的进化起源,本研究通过计算相关同源基因的同义替换率估算CNMS位点的进化关系。结果表明拟南芥-甘蓝直系CNMS位点在15百万年(Million years, Myr)前起源于同一祖先位点;大部分拟南芥旁系CNMS位点起源于28 Myr前拟南芥基因组大规模的重复事件,而少部分旁系CNMS位点则起源于42 Myr前十字花科的共同祖先序列。基于计算结果推测:一些古老的拟南芥旁系同源CNMS位点在甘蓝基因组中应该存在相应直系同源位点。进一步比较拟南芥-甘蓝和拟南芥-拟南芥调控区保守SSR序列,发现18个拟南芥旁系同源CNMS在甘蓝基因组中至少存在一个保守等位位点。这些同时出现在拟南芥-甘蓝直系同源和拟南芥-旁系同源基因调控区的Ultra-CNMS位点在其它进化关系较远的植物基因组中同样存在序列保守性。根据Gene Ontology的功能注释,206个拟南芥-甘蓝CNMS相关基因以及194个拟南芥-拟南芥CNMS相关基因具有较为明确的功能。功能分析表明CNMS相关基因的功能与转录因子活性和转录调控显著相关。生物信息学预测显示CT/GA和CTT/GAA保守重复序列分别与响应光信号和水杨酸信号的已知功能顺式作用元件相似。本研究通过分析拟南芥CNMS (CTT)n/(GAA)n相关基因在水杨酸处理后的表达模式来验证计算机预测结果。根据拟南芥MPSS表达数据显示约70%-80%的拟南芥CNMS (CTT)n/(GAA)n相关基因在叶片中的表达丰度明显受水杨酸调控。采用半定量RT-PCR分析其中7个CNMS (CTT)n/(GAA)n相关基因的表达谱,目标基因在水杨酸处理后的表达模式与拟南芥MPSS数据所反映的基因表达谱基本一致。为进一步研究CTT/GAA类型SSR序列与水杨酸诱导之间的联系,采用缺失方法对包含CTT重复位点且受水杨酸诱导的拟南芥AtHip1基因启动子进行水杨酸调控元件分析。4个不同缺失体与gus基因融合构建植物表达载体,利用农杆菌介导法转化拟南芥。转基因植株经报告基因gus蛋白活性测定及定量PCR分析,发现AtHip1启动子-399至-184的216 bp区域是启动子转录调控的核心区域,该区域缺失导致启动子水杨酸诱导功能丧失。生物信息学分析发现AtHip1启动子-399至-184的216 bp的序列除潜在响应水杨酸信号的CTT重复元件外,并未发现其它参与水杨酸信号应答的功能元件,说明存在于调控区域内的CTT重复序列为水杨酸信号应答的顺式作用元件。来源于表达序列标签(Expressed Sequence Tag, EST)的SSR标记,不仅具备传统基因组来源的SSR标记所有优势,还反映出基因转录区的差异,其多态性可能与所在基因的功能直接相关,具有很高的实际应用价值。随着EST测序的快速发展,尤其是同一基因来自不同亚种或品种的EST序列的大量重复测定,使同一物种中许多基因的EST序列存在大量的冗余信息,其中一些冗余序列包含SSR位点长度多态性信息。本研究以此为基础发展了EST-SSR多态性位点大规模发掘的计算机方法。利用该方法对公共序列数据库中玉米、大豆、水稻、小麦、油菜、大麦、棉花、西红柿、马铃薯及高梁10个主要作物的EST序列进行分析,共检测到15,640个等位位点存在长度多态性的SSR位点。10种作物中,EST-SSR多态性位点占被检测SSR位点的比率介于0.7%至2.61%,其中玉米EST-SSR多态性比率最高,西红柿EST-SSR位点多态性最低。这些EST-SSR多态性位点主要集中于二、三核苷酸重复类型,约占所有发掘位点的84%。分析发掘的EST-SSR等位位点长度变异,发现EST-SSR因突变而导致重复单元增加的等位位点明显多于重复单元减少的等位位点,表明EST-SSR突变倾向于增加位点长度。EST-SSR多态性位点来源于基因转录区,物种间存在很高通用性。对所发掘的15,640个具有长度多态性的EST-SSR进行通用性分析,EST-SSR多态性位点的通用性比率在14.1%至45.9%之间,高粱(45.9%)、小麦(39.1%)和大麦(38.2%)的EST-SSR多态性位点在相关作物间的通用性最高,油菜(14.1%)EST-SSR多态性位点在相关作物间的通用性较低。作物EST-SSR标记通用性表明:对于缺乏标记资源的作物,可利用其它作物已有的EST-SSR标记来开发相应的标记,不失为一种有效的替代方法。根据Gene ontology的植物代表性GO slims,对14,084个具有EST-SSR多态性位点的基因进行功能分类,8,952基因序列涉及108,601个GO功能注释。对包含SSR多态性位点的相关植物基因按GO生物学过程(biological process)分类,主要涉及蛋白质代谢、转运、转录、逆境应激、发育、信号转导等过程;分子功能(molecularfunction)主要包括蛋白结合、DNA或RNA结合(包括转录调控活性、转录因子活性)、水解酶活性、转移酶活性等。为方便查询相关EST-SSR多态性位点信息,以MySQL数据库管理系统构建了EST-SSR多态性位点数据库。EST-SSR多态性位点数据库包括重复单元、等位位点长度、品种来源、相关基因功能以及相关候选扩增引物等信息。用户通过网络游览器查看有关信息以及EST序列簇的详细拼接结果。用户还可以通过BLAST程序与包含SSR多态性位点的EST序列进行同源比较分析。网站同时提供相关数据下载服务。

【Abstract】 Simple Sequence Repeats (SSRs), as short tandem repeated sequences, are extremely common in plant genomes. SSRs are generally thought to originate from genomic repetitive DNA and regarded as“junk”DNA without any apparent function. With the advantage of genome sequencing, the recent investigation showed SSRs are preferentially associated with nonrepetitive DNA in plant genomes. They can be found abundantly within or near plant genes, and in particular, some types are significantly enriched within the 5’regulatory regions. It implies SSR within the regulatory regions may play vital roles in gene expression or function in plants. Thus, investigation of these over-represented SSRs will help to understand their function in gene regulation in plants.SSRs are significantly enriched in the regulatory regions of Arabidopsis genome, and this feature is mostly attributable to the over-representation of CT/GA and CTT/GAA repeats which account for about 60% of all SSR in the regions. Given these SSRs are important for regulating gene expression and they should be conserved in homologous promoters due to functional constraints during plant evolution. To address the question of SSR associated with gene regulation, we used inter- and intra-genomic phylogenetic footprinting to analyze the dominant SSRs in the 5’noncoding regions of Arabidopsis and Brassica oleracea genes for conserved noncoding SSRs, or conserved noncoding microsatellite sequences (CNMSs). We identified 247 Arabidopsis-Brassica orthologous and 122 Arabidopsis paralogous CNMSs, representing 491 CT/GA and CTT/GAA repeats, which accounted for 10.6% of these types located in the 500 bp regions upstream of coding sequences in the Arabidopsis genome. In order to ensure that the observation of CNMSs was not simply due to its over-representation in plant genomes, a similar analysis carried out based on three different random datasets, and it indicated that some SSRs in regulatory regions were conserved from common ancestors during plant evolution.To gain further insight into the evolutionary relationship of Arabidopsis-Brassica and Arabidopsis-Arabidopsis CNMSs, the synonymous substitution rate (Ks) was calculated for the corresponding gene pairs. The frequency distribution of Ks suggested that the Arabidopsis-Brassica orthologous CNMSs were conserved from a common ancestor over a 15 million years (Myr) period, while most of the paralogous CNMSs were originated from large scale gene duplication over 28 Myr ago and others were duplicated from the common ancestor of brassicaceae family over 42 Myr ago. The results from the evolutionary relationships of Arabidopsis-Brassica and Arabidopsis-Arabidopsis CNMSs suggested that most paralogous CNMSs pre-dated the divergence of the two species. Further comparisons of paralogous and orthologous genes from Arabidopsis and Brassica were made for common CNMSs. With the same criteria, we identified 18 Ultra-CNMSs found in Arabidopsis paralogous pairs that also were coincident with CNMSs from at least one orthologs in Brassica and many Ultra-CNMSs were conserved across a number of more distantly homologous genes in Brassicaceae species and other plants.Function annotations based on Gene Ontology showed that there were 206 Arabidopsis–Brassica and 194 Arabidopsis–Arabidopsis CNMS associated genes with known function and their function were significantly enriched for transcription factor activity and transcription regulation. These findings suggested that CNMSs might be specifically associated with regulation of transcription. Computational prediction of cis-acting elements revealed that CNMS (CT)n/(GA)n were similar to the known motif involved in light responsiveness and CNMS (CTT)n/(GAA)n were involved in salicylic acid responsiveness. The abundance of gene transcripts evaluated by the MPSS showed about 70%-80% of CNMS (CTT)n/(GAA)n associated genes in Arabidopsis leaves were regulated by salicylic acid. Seven CNMS (CTT)n/(GAA)n associated genes were additionally analyzed for expression patterns after salicylic acid treatment with RT-PCR. The results showed that expression of these investigated genes were consistent with the patterns of gene expression from the Arabidopsis MPSS database.In order to validate the CTT/GAA repeats as salicylic acid-responsive elements, four 5’ deletions of the salicylic acid induced CTT repeat-containing AtHip1 promoter were fused to theβ-glucronidase (GUS) gene and introduced into Arabidopsis plants. The histochemcal assays of GUS activity and the expression level investigation of gus gene by real-time PCR on promoter transformant plants revealed that the AtHip1 promoter from -399 to -184 region (216 bp) relative to transcription start site is core promoter for gene transcription regulation. Deletion of this region led to the AtHip1 promoter lacking the salicylic acid-responsive function. Bioinformatics analysis revealed there was no known salicylic acid induced elements but the CTT repeated element in this region. Taken together, these results demonstrate that the CTT tandem repeated sequences within 5’regions as cis-acting elements play important roles in the salicylic acid regulation.Expressed Sequence Tag (EST) derived SSRs as genetic markers are specific associated with gene expression and fucntion. The large number of ESTs in databases is a valuable resource to develop SSR markers. EST databases may contain redundancy in sequences of a particular gene, such as different alleles derived from heterozygous individuals or from different genotypes. Some redundant ESTs can contain information on length-polymorphisms in SSRs. We developed an in silico tool for identification of polymorphic SSRs based on EST sequence redundancy. Using this tool, we identified 15,640 polymorphic EST-derived SSRs from maize, soybean, rice, wheat, rape, barley, cotton, tomato, potato and sorghum. The percentage of polymorphic SSRs ranged from 0.7% for tomato to 2.61% for maize. The EST-derived SSRs mainly consist of dinucleotide and trinucleotide repeats, accounting for 84% in all identified polymorphic EST-SSRs. Length polymorphism of all identified 15,640 EST-SSRs revealed a mutational bias of EST-SSRs that alleles tend to increase in size.EST-SSRs are derived from transcripts. Homologous analysis on indentified 15,640 plolymorphic EST-SSRs indicated the in silico EST-SSRs had a high level of transferability across crop species and the percentage of transferability ranged from 14.1% for rape to 45.9% for grass species such as sorghum (45.9%), wheat (39.1%) and barley (38.2%). Large-scale identification of polymorphic EST-SSRs by in silico approach greatly improves the efficiency of marker development. It is practicable to develop new molecular markers based on EST-SSRs transferability for those poor informative plants.Each of unique ESTs with polymorphic SSRs was searched against the uniprot/swiss-prot database by BLAST and the assigned uniprot/swiss-prot IDs were classified according to the GO terms using Plant GO-Slims into categories. The results showed that 8,952 out of 14,084 unique ESTs were associated with 108,601 GO annotations. Functional categories revealed ESTs with polymorphic SSRs were mainly involved in biological process such as protein metabolism, transport, transcription, response to stresses, developmental processes and signal transduction, while their molecular functions were preferentially associated with protein binding, DNA or RNA binding, hydrolase activity and transferase activity.To facilitate access this resource of polymorphic EST-SSRs from crops, we developed a database providing the detailed information of these EST-SSRs such as SSR motif, allele length, cultivar, gene function and primers. The database also provided a viewing of EST assembly and a homologous analysis of SSR-containing ESTs among the related species by BLAST. The online service of EST-SSR database was implemented in Perl + MySQL, and the data is available for download.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络