节点文献

基于简化基因组测序的油菜高通量SNP分析及白菜基因组DNA甲基化解析

High-throughput SNP Analysis in Oilseed (Brassica Napus L.) and Genome-Scale DNA Methylation Profiling in Brassica Rapa Based on Reduced Representation Sequencing

【作者】 陈勋

【导师】 刘克德;

【作者基本信息】 华中农业大学 , 作物遗传育种, 2014, 博士

【摘要】 芸薹属包括白菜、甘蓝和甘蓝型油菜等很多重要的经济作物,是与模式植物拟南芥亲缘关系最近的近缘种之一。芸薹属中绝大部分物种都是多倍体,其中二倍体的白菜和甘蓝也属于古三倍体,很多基因均存在三个及以上拷贝。而甘蓝型油菜是异源四倍体作物,由白菜和甘蓝在自然条件下杂交而成。目前,甘蓝型油菜的基因组序列还未公布,依赖参考基因组序列的大规模SNP分析还无法进行。另外,基因组中普遍存在的同源序列,阻碍了芸薹属作物基因组学和表观基因组学等方面的研究。本研究主要基于双酶切缩减文库和高通量测序技术,对甘蓝型油菜的一个DH分离群体进行简化基因组测序,并开发了配套的SNP分析软件RFAPtools,从复杂的同源序列中识别等位基因,构建高密度遗传连锁图谱;此外,我们还将双酶切缩减文库结合重亚硫酸盐测序,开发了双酶切RRBS技术,对白菜基因组水平上的DNA甲基化进行了解析。1.构建甘蓝型油菜高密度遗传图谱。遗传图谱是基因组学研究的必备工具,而多态性标记又是遗传图谱的基础。大量存在的同源序列及基因组序列的缺乏,使得很难在多倍体作物如甘蓝型油菜中,开发和定位SNP等多态性标记。为了解决这个问题,我们设计了一种缩减文库的构建方法,同时开发了配套的生物信息学分析软件RFAPtools。该软件主要包括三部分:1)模拟参考序列的构建;2)SNP检测;3)从同源序列中区分出等位SNP变异。通过模拟酶切,我们分析了富集到的酶切片段在染色体上的位置分布、片段的大小分布以及每个单株所需的最适数据量,证明了所开发的简化基因组测序技术的可行性。RFAPtools软件首先通过模拟参考序列的构建,可以将部分同源序列分开,同时利用prf_allele.sh脚本,基于群体数据可以从同源序列中区分出属于同一位点的等位SNP。因此该技术适用于所有物种,进行高通量SNP分析,特别是类似于甘蓝型油菜、小麦等基因组复杂且未完成全基因组测序的物种。对两个亲本及BnaNZDH群体进行简化基因组测序,利用RFAPtools软件开发SNP及分析群体基因型。最终构建了两张平行的高密度遗传连锁图,包括一张包含8780个SNP位点的遗传连锁图以及一张包含12423个显性位点的PAV遗传连锁图。将这两张遗传连锁图A亚基因组上的位点序列与白菜基因组进行共线性分析,总共检测到14个可能的拼接错误及8个可能的定位错误scaffolds序列,对白菜基因组序列进行纠正。同时与白菜未定位的scaffolds序列进行比对分析,将44个未定位的scaffolds序列(包含8.15mb)定位到白菜不同染色体上。为了验证该方法的准确性和重复性,我们随机选取44个SNP位点进行Sanger测序,并将其转化成CAPS标记检测亲本间多态性。其中26个位点得到验证,而未被验证的18个SNP位点的PCR扩增产物中,均包含多条同源序列或不含目标位点序列。利用26个得到验证的SNP位点检测91个DH单株的基因型,总共检测到2251基因型且准确性高达99.3%。对其中6个DH单株重新构建缩减文库并测序,进行重复实验,其中SNP位点的重复性高达99%以上,而PAV重复性与数据量有关,当两次重复的数据量均高于150万reads时,其重复性也较高,达到98%以上。2.解析白菜的全基因组DNA甲基化。DNA甲基化在基因表达及转座子沉默等过程中起调控作用,是最重要的表观修饰之一。近年来利用各种高通量技术对多种植物的DNA甲基化组进行了分析,为此我们改进了之前开发的缩减文库构建方法,开发了双酶切RRBS技术,并利用该技术对白菜全基因组DNA甲基化进行研究。通过比较分析发现,双酶切RRBS技术富集到的染色体区域中三种基序分别在基因和转座子区的比例,与白菜全基因组水平上基因和转座子区甲基化比例一致。同时对水稻基因组进行模拟酶切,通过与全基因组的比较分析,也得到一致的结果,证明双酶切RRBS技术能够被用来解析全基因组DNA甲基化。利用该方法,我们分析了白菜CG和non-CG位点的全基因组DNA甲基化水平,分别为CG52.4%、CHG31.8%及CHH8.3%。绝大部分CG位点不是未甲基化就是被高度甲基化修饰,而51.8%CHG及77.4%CHH位点为低甲基化修饰。同时分析了白菜不同染色体上DNA甲基化分布,发现DNA甲基化与转座子等重复序列分布一致,而与基因的分布相反。除了A02染色体的真实着丝粒区域,绝大部分真实着丝粒和古着丝粒区域均维持在高度甲基化状态。基因和转座子区域的DNA甲基化水平差异很大,其分布规律均与拟南芥类似,即在基因转录起始和终止位置区域甲基化水平最低,且基因区明显低于侧翼序列,转座子区域维持在一个比较恒定的高甲基化修饰状态。对不同亚基因组间基因区DNA甲基化进行分析,表现为LF<MF2<MF1但是差异并不明显,且该结果与基因表达水平上差异一致。对不同拷贝数基因间的DNA甲基化进行分析,发现单拷贝基因的DNA甲基化水平明显高于多拷贝基因,且转录起始和终止位置附近区域的DNA甲基化差异最大。因此认为DNA甲基化水平较高的基因更容易丢失,DNA甲基化水平较低的基因更容易被保留。LF亚基因组中单拷贝基因DNA甲基化水平显著低于其它两个亚基因组,而多拷贝基因间并没有显著差异。因此认为不同亚基因组单拷贝基因的DNA甲基化差异导致了不同亚基因组间的DNA甲基化差异,并决定LF中基因丢失的比例显著低于另外两个亚基因组。从表观遗传学上,解释了基因丢失的可能分子机理,及白菜三个亚基因组间基因丢失比率的差异。

【Abstract】 The cultivated Brassica species include many important economic crops like Brassica rapa, Brassica oleracea and Brassica napus et al., which are one of the most closely related species to Arabidopsis thaliana. Most members of the Brassicaceae family are all polyploidy species, that diploid B. rapa and B. oleracea are considered as ancient triploid in which many genes contained three copies, and allotetraploid B. napus derived from naturally hybridization between B. rapa and B. oleracea. It is unable to high-throughput analyze SNP variations in B. napus without reference genome sequence. On the other hand, the presence of homoeologous sequences, would also hinder the Brassica genomics and epigenomics studies et al. Based on double-digestion reduced representation library and next generation sequencing technology, we sequenced an oilseed DH population and designed RFAPtools software to discriminate allelic SNPs from homoeologous sequences, and constructed two high-density genetic maps; Combined bisulfite-treatment technology, we developed modified RRBS technology to perform the genome-scale DNA methylation profiling in B. rapa.1. Construction of high-density genetic map in B. napus. Genetic maps have become essential tools for a wide range of genetic and genomics studies, which largely depend on polymorphic molecular markers. The presence of homoeologous sequences and absence of a reference genome sequence make discovery and genotyping of single nucleotide polymorphism (SNP) more challenging in allotetraploid B. napus. To address this challenge, we developed a reduced representation library construction technology, and designed a bioinformatics software called RFAPtools. RFAPtools consisted of three modules i.e.,1) assembly of a pseudo-reference sequence,2) SNP identification and3) discrimination of allelic SNPs from homoeologous sequence variations.Through in silico enzyme digestion, we analyzed the distribution of fragments across chromosomes, the length of fragments and suitable sequence data for each individual. RFAPtools would separated most homoeologous sequences, through the construction of pseudo-reference sequence. On the other hand, based of population sequence data, prf_allele.sh script would discriminate allele SNPs from homoeologous sequences. Hence this methodology is suitable for SNP analysis in all species, especially for species with complex genome structure without genome sequence. A common set of restriction fragments across a double haploid (DH) population (BnaNZDH) of highly established allotetraploid Brassica napus and its two parents were sequenced. Allelic SNPs and the presence/absence variations (PAVs) were identified using RFAPtools. Two parallel linkage maps, one SNP bin map containing8780SNP loci and one PAV linkage map containing12,423dominant loci, were constructed. By aligning these linkage maps to the B. rapa reference genome sequence, we assigned44unassembled sequence scaffolds comprising8.15Mb onto the B. rapa chromosomes, and also identified14instances of possible misassembly and eight instances of possible mis-ordered sequence scaffolds. To investigate the authenticity of identified SNPs, we randomly selected44SNPs, to directly sanger sequence and be transfer to CAPS markers to detected polymorphism between parents.26of all could be confirmed, and the PCR products of other18SNPs loci contained homoeologous sequences or did not result in target sequences. We also surveyed the91DH lines to validate the SNP genotypes using the26confirmed SNPs. A total of2251genotypes were generated with an accuracy of99.33%. Furthermore, we sequenced6DH lines in duplicate with different number of reads. The consistency of SNP genotypes between the two replications was higher than99.88%, and the consistency of PAV genotypes was sensitive to sequence data that higher than98%with more than1.50million reads.2. Genome-scale DNA methylation analysis in B. rapa. DNA methylation is one of the most important epigenetic modification, which would influence the gene transcription and transposon silencing. Recently epigenome of many important plant species were dissected using diverse high-throughput technology. Here we modified reduced representiation library methodology designed previously and developed modified RRBS technology, and applied it to dissect genome-scale DNA methylation in B. rapa. Through the comparism between sequences enriched by mRRBS and whole genome sequence, by calculating the percentage of three contexts (CG, CHG and CHH) distributed in gene and transposon region. Consistent results, which also from the in silico double digestion study in rice, confirmed that mRRBS could be used to dissect whole genome DNA methylation.Using mRRBS, we calculated whole-genome methylation levels at CG and non-CG sites, and observed overall genome-wide levels of52.4%CG,31.8%CHG and8.3%CHH methylation. Most CGs were either unmethylated or highly methylated, and51.8%CHG and77.4%CHH sites were hypomethylated. The chromosomal distribution of average methylation level of three contexts were studied and found that the distributions are consistent positive with repeats and negative with gene contents. Except lower DNA methylation distributed at pericentromeric region of A02chromosome, extensive DNA methylation detected around extant and ancient centromere regions. DNA methylation in gene and transposon regions were different, and the distributions in these regions were similar to Arabidopsis, that lowest around transcription start site and transcription termination region, and lower in gene-body compare to upstream or downstream regions. We also found stable extensive DNA methylation along transposon regions.We profiled the DNA methylation in gene regions belonging to three paleogenomes, resulted in LF<MF2<MF1without significant difference, and this result was consistent to gene expression level study. We also characterized the DNA methylation in different components of single-copy and duplicated genes, and found higher methylation in single-copy compared to duplicated genes especially around transcription start site and transcription termination region. Hence we considered that genes hypermethylated were prompt to be discarded and more hypomethylated genes were retained. Lower methylation level for single-copy in LF compared to other two subgenomes, but no consistent and significant difference for duplicated genes between three subgenomes were detected. We considered that differential DNA methylation between three subgenomes was due to differential DNA methylation in single-copy genes, and resulted lowest gene loss ratio in LF compared to another two subgenomes. Based on B. rapa epigenomics studies, we finally uncovered the possible molecular mechanism controlling gene loss and differential gene loss in three subgenomes in B. rapa.

  • 【分类号】S565.4;S634.3
  • 【被引频次】2
  • 【下载频次】2035
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络