节点文献

中国梅花鹿全基因组初步组装、分析及单核苷酸多态性研究

Preliminary Assembly and Analysis of Chinese Sika Deer Genome and Single Nucleotide Polymorphisms Studies

【作者】 巴恒星

【导师】 李春义;

【作者基本信息】 中国农业科学院 , 特种经济动物饲养, 2012, 博士

【摘要】 新一代高通量测序技术的诞生和快速发展使对一个物种的基因组进行细致全貌的分析成为可能。本文利用SOLiD(Applied Biosciences)测序平台和全基因组鸟枪法测序策略对一只中国梅花鹿(东北亚种,Cervus nippon hortulorum)进行了全基因组测序,东北梅花鹿亚种是中国鹿产业中最重要的鹿种之一。对梅花鹿基因组原始测序数据进行初步的质量评估后,共产生约1.9×109个50bp双末端配对读段,评估基因组测序深度约32倍,构建了插入长度1kbp和2kbp两个测序库。本文组合了当前可利用的基因组组装策略,包括全局从头组装,参考局部向导组装,也利用了鹿与牛基因组之间的保守序列进行共线性局部组装,尽最大限度地组装了梅花鹿基因组。产生约4百万长度大于100bp个重叠群(contigs),包含碱基总量1.83Gbp,N50值为695bp,最大重叠群长度10.80kbp。梅花鹿基因组质量评估发现约0.3%的重叠群存在组装错误,进一步更正了组装错误。最后利用双末端配对信息对重叠群定位和定向,产生约1.9百万个基因组序列框架(scaffolds),N50值为21.6kbp,最大长度249kbp,包含碱基总量2.6Gbp,约1.8百万个缺口(gap)。本文通过各种生物信息学数据处理方法对组装后的梅花鹿基因组作了进一步的分析,主要结果和结论如下:1.测序偏倚导致梅花鹿基因组不同区域覆盖率相差较大,基因组区域GC含量越高,其覆盖率越低。梅花鹿基因组覆盖牛基因组约62%(相似性85%以上),覆盖鹿转录组约62%(相似性90%以上)。2.从鹿、牛和羊的微卫星引物数据集中筛选了1,534个在梅花鹿基因组上保守的微卫星引物,占收集微卫星总数的61%。3.鹿、牛基因组功能区SNP变异与Indel变异比非功能区更趋保守,基因组水平上的SNP变异与Indel变异有强的正相关性,鹿、牛与人、黑猩猩的基因组SNP变异数据比较结果支持分子钟理论。4.在组装的梅花鹿基因组中筛选了2.7百万个SNP杂合位点,平均每678bp包含1个SNP位点。梅花鹿个体常染色体基因组、外显子区和编码区的SNP杂合率分别为0.152%、0.087%和0.082%,梅花鹿基因组表现高度杂合现象暗示中国梅花鹿在长期的人工饲养和驯化过程,具有不同遗传背景的梅花鹿发生了血缘交换。5.本研究产生的6,367个SNP位点非常适合中等密度的SNP分型芯片的定制,并能在种属间,包括梅花鹿、马鹿和赤鹿等鹿种,进行检测分型等相关研究,证明了开发高密度的鹿全基因组SNP分型芯片具有可行性。6.梅花鹿线粒体基因组的组装进一步证实了Numts序列在脊椎动物和无脊椎动物核基因组中广泛存在,同时也暗示梅花鹿核基因组中存在大量的Numts序列,其数量相当于1,867个线粒体基因组。

【Abstract】 The advent and rapid development of next generation sequencing technology have enabled us toanalyse the whole-genome of any species comprehensively. In this study, we sequenced thewhole-genome of an adult male Chinese sika deer (Cervus nippon hortulorum), one of the mostimportant species for the Chinese deer industry, using a whole-genome shotgun sequencing approachand SOLiD (Applied Biosciences) sequencing technology platform.Following the preliminary analysis of the raw sequencing data, we obtained1.92billion of the50bp paired-end reads equivalent to32x coverage of the sika deer genome. Two separate librarieswere made, one with1kbp inserts and the other with2kbp inserts. For the sika deer genomeassembling we combined a variety of currently available techniques. This included global de novoassembly, reference-guided local assembly and conserved synteny local assembly. Furthermore, weutilized the conserved synteny between sika deer and cattle genome. We generated4.15millioncontigs of100bp size or longer, comprising a total of1.83billion base pairs of assembled sequence.Half number of the contigs are longer than695bp (N50), the maximum contig length is10.80kbp. Thequality of these assembled contigs was checked and about0.3%contigs are misassembled. Thesemisassembled contigs were then corrected before the subsequent analyses were carried out. We furtherutilized the paired-end information to generate1.94million scaffolds with an N50of21.6kbp, amaximum length of249kbp. The assembled genome has1.8million gaps.Next, we analyzed the assembled sika deer genome by a verity of bioinformatics approaches andtools. The main results and conclusion are as follows.1. The sequencing bias in the different regions of the assembled sika deer genome led to anapparently different coverage. The higher the GC content in different genome regions, the lowerthe region coverage. The assembled sika deer genome covers about62%of the reference cattlegenome (the identity is above85%), and covers about62%of the deer transcriptome sequence (theidentity is above90%).2. Amongst the collected datasets of the microsatellite primers of deer, cattle and sheep, the1,534microsatellite primers,61%in total, were screened as candidates in the sika deer genome becausethey are conservative between the sika deer and the cattle and sheep.3. The variations between the deer and the cattle genome are more conservative in the functionalregions than in the non-functional regions. There is a strong positive correlation between the SNPvariations and Indel variations among the different genome regions. The SNPs between humanand chimpanzee are compared with those between cattle and deer, the results validated once againthe theory of molecular clock.4. Within this sequenced and assembled genome, the2.7million SNPs were detected, which isequivalent to one SNP of every678bp. The SNP heterozygous rates were0.152%,0.087%and0.082%in the autosomal genome, exon and coding regions, respectively. The high SNPheterozygous rates in the sika deer genome implies that the sika deer belongs to the different genetic background that had produced the gene flowing in the long-term domestication andartificial breeding processes.5. This study produced the6,367SNP sites that are suitable for the development of medium densityof genotype SNP chips. This type of chip if developed can be applied to different deer species,including sika deer, red deer and wapiti. In addition, it proved the feasibility about thedevelopment of high density whole genome SNP genotype chip.6. We further confirmed the claim that the Numts sequence are widespread in vertebrates andinvertebrates nuclear genome, which is in accordance with the assembled sika deer mitochondrialgenome. This implies that there are numerous Numt sequences in the sika deer nuclear genomealmost equivalent to1,867mitochondrial genomes

节点文献中: 

本文链接的文献网络图示:

本文的引文网络