节点文献

基于生物信息学方法分析基因家族及非编码序列的研究

Study of Gene Families and Non-coding Sequences Based on Bioinformatics Methods

【作者】 汪旭升

【导师】 朱军;

【作者基本信息】 浙江大学 , 作物遗传育种, 2006, 博士

【摘要】 水稻(Oryza sativa L.)是世界上最主要的粮食作物之一,为世界一半以上的人口提供主粮。水稻分为两个亚种,即籼稻和粳稻。现在水稻的两个亚种(93-11和日本睛)的基因组草图已公布。另外,日本晴的1号,4号和10号染色体的精细图也已完成。拟南芥的基因组相对较小,是植物遗传学研究的一个模式植物。2000年下半年,拟南芥成为第一个全基因组测序完成的植物。本论文的研究目的是从改良作物的抗病和抗旱两大重要的育种目标出发,对水稻抗病和抗旱两大基因家族的基因的分布,表达和调控进行了分析。我们根据抗病基因结构功能域的保守性,改进了抗病基因类似物(RGA)多态性标记。同时,根据内含子序列变异大,而外显子变异较小开发了内含子长度多性标记;且对长非编码序列、短非编码序列和特殊序列进行研究,从而了解植物基因的表达调控。本论文采用生物信息学的方法和结合水稻和拟南芥的基因序列,围绕上述提出的几大研究目的,对当前作物遗传育种中的几大研究问题进行深入的研究。主要研究结果有:(1)水分胁廹基因家族-LEA蛋白基因家族研究结果发现34个LEA蛋白的同源序列,其中本文发现25个新基因或有关的基因。通过与全长cDNA联配,发现4个OsLEA具有可变剪切。除了10号和12号染色体外,所有的LEA分布在水稻染色体。另外,我们发现具有两个独立的转换事件。利用RT-PCR的方法,对15个OsLEA的表达分析表明,OsLEA基因表达具有各式各样,一些是组成性表达,一些是受胁迫调控的。我们在受ABA诱导和干旱诱导的LEA的基因中,发现了CACAGTA和CACGCACG元件。(2)抗病基因(R)的基因家族本论文利用45个已知功能的植物抗病基因序列对粳稻全基因组序列进行搜索,共找出2,119个R基因同源序列或类似物(RGA),证明RGA在水稻基因组中成簇存在,呈非随机分布。采用隐马尔柯夫模型(HMM),将这些RGA按其功能域分成了21类。将粳稻的RGA与籼稻的基因组序列进行比较,共找到702个两亚种间等位的RGA,并发现其中有671个(占95.6%)RGA的基因组序列(包括编码区和非编码区)在两亚种间存在长度差异(InDel),表明水稻RGA在两亚种间存在很高的多态性。通过在InDel两侧设计引物并进行e-PCR验证,共开发出402个基于PCR的、表现为共显性的候选RGA标记。这些候选标记在两亚种间的长度差异在1—742 bp之间,平均为10.26 bp。进而,我们对所有的182个抗病基因簇进行了进化分析。(3)非编码序列-内含子长度多态性的分子标记(ILP)本研究利用水稻两个品种93-11(籼稻)和日本晴(粳稻)的基因组草图及日本晴的32,127条全长cDNA基因组序列,我们进行全基因组搜索ILPs,结果发现13,308个候选ILPs。基于这些候选的ILPs,我们利用电子PCR(e-PCR)在两侧外显子上设计引物,开发了5811个ILP候选标记。(4)长非编码序列-保守非编码序列(pCNE)我们在单子叶和双子叶植物中通过直系同源的方法,找到了436个pCNEs。通过搜索旁系同源CNEs,我们在拟南芥中找到了7,972个pCNE。我们假定功能特异的蛋白与所对应的旁系和直系同源的CNEs相关联,结果发现CNEs往往与转录因子一起起作用。富集的转录因子主要是myb转录因子和锌指蛋白。(5)短保守非编码序列-转录因子结合位点基因间序列中存在大量的调控序列,其中主要的是转录因子结合位点。我们能过Pearson相关在四个不同的组织中找到了787个共表达的组织特异性的基因。利用逐步回归的方法,对于每个基因,在其上游启动子序列中找到显著的转录因子结合位点。我们系统地分析了单个和组合的结合位点,这些结合位点控制着基因的转录和表达。控制不同组织的转录因子结合位点的类型不同,其中花粉具有62个,根部具有69个等。(6)特殊序列-甲基化位点预测对于基因的表达,不仅受到转录因子的调节,还受DNA甲基化等各种调节和修饰。DNA甲基化与许多生物学过程有关,包括组织特性基因的表达,基因组印记。我们描述了计算预测拟南芥基因组甲基化的情况。我们利用不同的判别方法来分析甲基化与非甲基化区域。结果表明,基于实验证实的甲基化数据,Logistic模型树(LMT)分类器方法具有71.03%的预测准确性。

【Abstract】 Rice (Oryza sativa L.) is one of the most important cereal crops in the world and feeding more than half of the world population. It can be categorized into two main subspecies, japonica and indica. Rice is a model plant of cereal species because of its relatively small genome size. In rice, draft genome sequences of two cultivars, 93-11 and Nipponbare, representing indica and japonica subspecies, respectively, have been released. Complete sequences of chromosomes 1, 4 and 10 of Nipponbare have been published. Arabidopsis is of no major agronomic significance, but it offers important advantages for researches in genetics and molecular biology because of the small genome size, short generation time, production of plenty of seeds and the ease of transformation by simple techniques. Arabidopsis is the first plant of which the complete genome has been sequenced and published in late 2000. A gene family is a set of genes defined by presumed homology, i.e. evidence that the genes evolved from a common ancestral gene.The purpose of this study is to improve crop breeding and genetics. Drought stress and disease-resistance are main reseach areas. We analyzed the drought stress (LEA) and R gene family in rice. We also improved resistance gene analogues (RGA) polymorphism markers and developed the intron length polymorphisms marker. In addition, the long - and short-noncoding sequence were studied. Therefore, using bioinformatics methods and the combination of rice and Arabidopsis gene sequences, we conducted the following studies:(1) LEA gene familyA total of 34 rice LEA {OsLEA) genes were identified, of which 25 were new. We also identified four OsLEA genes with alternative splices by alignment of full-length cDNA. The OsLEA genes are distributed on the rice chromosomes except that chromosome 10 and 12. Two independent conversion events were observed. Microarray analysis indicated that most of OsLEAs are regulated by different stress treatments. Expression analysis of 15 OsLEA genes with the method of semiquantitative reverse transcription (sqRT)-PCR revealed that the expressions of OsLEA genes are very diverse, some are consititutive, some are regulated and some appear to be related to stress tolerance. Motifs CACGTA and Motifs CACGCACG showed a clear overrepresentation in the upstream region when we searched for conserved DNA elements in the 1,000 bp upstream regions of the ABA-induced and drought-induced LEA genes.(2) R gene family and RGA markersBy scanning the whole genomic sequence of japonica rice using 45 known plant disease resistance (R) genes, we identified 2,119 resistance gene homologies or analogs (RGAs) and verified that RGAs are not randomly distributed but tend to cluster in the rice genome. The RGAs were classified into 21 families according to their functional domain based on Hidden Markov model (HMM). By comparing the RGAs of japonica rice with the whole genomic sequence of indica rice, we found 702 RGAs allelic between the two subspecies and revealed that 671 (95.6%) of them have length difference (InDels) in their genomic sequences (including coding and non-coding regions) between the two subspecies, suggesting that RGAs are highly polymorphic loci between the two subspecies in rice. We also exploited 402 PCR-based and co-dominant candidate RGA markers by designing primer pairs on the regions flanking the INDELs and validating them via e-PCR. The length differences of the candidate RGA markers between the two subspecies are from 1 to 742 bp, with an average of 10.26 bp.(3) Intron length polymorphisms markerIn this study, we performed a genome-wide search of ILPs between two subspecies (indica and japonica) in rice using the draft genomic sequences of cultivars 93-11 (indica) and Nipponbare (japonica) and 32,127 full-length cDNA sequences of Nipponbare obtained from public databases. We identified 13,308 putative ILPs. Based on these putative ILPs, we developed 5811 candidate ILP markers via e-PCR with primers designed in flanking exons. We further conducted experiment to verify the candidate ILP markers.(4) Conserved Noncoding Elements (CNEs)We identify 436 in plants by alignments of three species (Arabidopsis, rice and Poplus). By searching all CNEs against each other to identify the paralogoues CNEs in Arabidopsis, we find 7,972 pCNEs. We assume that functional specificity of proteins associated with CNEs is assumed to be conserved among orthologs and paralogs. The results indicate that most enriched genes flanking in CNEs are associated with the transcription factors. The enriched transcription factors mainly comprise myb family transcription factor and zinc finger protein.(5) Transcription factor binding sites (TFBSs)Seven hundred and eighty seven co-expressed genes are identified by Pearson correlation approaches and calibrated against the GO database. For each target gene, we identified TFBSs in proximal promotes Based on step-wise regression method. We systematically identified the individual and combination of TFBSs that controls gene transcription and expression. Using Pearson correlation r > 0.95, we identified 279 genes in the root, 172 in flower, 129 in pollen and 207 in the seed. The number of TFBSs types is different among the different tissues, but we observe a large difference in total number of TFBSs among the tissues.(6) DNA methylationExcept for TFBSs, a gene expression is also regulated by DNA methylation. DNA methylation is involved in various biological processes including tissue-specific gene expression, genomic imprinting. We presented a computational prediction the DNA methylation of Arabidopsis. We used several different discrimant methods to classify the methylation and non-methylation regions. The results showed that the classifier LMT method has a prediction accuracy of 71.03 % based on the experimental verified methylation data of Arabidopsis.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2007年 03期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络