节点文献

基于序列信息的核小体定位理论分析及预测

Theoretical Analysis and Prediction of Nucleosome Positioning Based on Sequence Information

【作者】 邢永强

【导师】 蔡禄;

【作者基本信息】 内蒙古大学 , 生物物理学, 2014, 博士

【摘要】 表观遗传学是后基因组时代的领舞者,核小体定位是表观遗传学的重要研究领域。核小体作为真核生物染色质高级结构的基本单位,不仅压缩了染色质结构,也发挥着重要的基因表达调控功能。核小体通过阻断蛋白因子与DNA序列的接触来完成对转录、复制、重组、修复、剪接、疾病的发生等过程的调控。研究真核生物的核小体定位不仅可以进一步阐明染色质高级结构的形成机制,也有助于揭示复杂的基因表达调控过程。核小体在真核基因组的位置由DNA序列、染色质重塑子、转录机器、组蛋白修饰、组蛋白变体等因素共同决定。迄今为止,DNA序列仍是对核小体定位影响程度最大的单一因素。基于序列因素研究核小体定位的理论和实验工作已经不少。然而,现有的大多数核小体定位的理论工作的研究焦点集中于缠绕在组蛋白八聚体的核心DNA,对核小体核心颗粒之间的连接DNA关注较少。本文基于核小体定位的高通量测序数据,详细分析了核小体核心DNA和连接DNA的序列特征差异,并以DNA的序列特征为输入参数分别发展了位置相关得分函数(position-correlation scoring function, PCSF)和支持向量机(support vector machine, SVM)预测酵母等真核基因组的核小体定位。主要研究成果如下:1.统计分析了酵母基因组核小体核心DNA(core DNA)和连接DNA(linker DNA)的K-mer(k=1,2,...6)特征和序列偏性特征Mk(i)(k=1,2,...6)。Core DNA内寡核苷酸片段的A+T含量低于linker DNA。A+T含量越高,序列的刚性越强,越不利于DNA的弯曲。因此,core DNA内由A和T组成的寡核苷酸片段含量低有助于核心DNA缠绕组蛋白八聚体。Linker DNA的k-mer偏性特征Mk(i)(k=1,2,...6)的值高于core DNA。因此,linker DNA的序列偏性强于core DNA,或者说linker DNA的序列保守性强于core DNA。这一发现为我们结合linker DNA的序列特征预测核小体定位提供了重要线索。2.信息冗余参数Dk描述了DNA序列的词汇组成和语法结构。计算酵母、果蝇、线虫基因组的核小体定位序列的Dk值证实核小体core DNA和linker DNA的Dk值存在显著差异;core DNA和linker DNA序列都具有短程关联为主性特征。这种以短程关联为主的特性也解释了为什么大多数基于寡核苷酸片段或k-mer信息的核小体定位理论预测模型能够取得较好的预测效果。我们也证实,core DNA和linker DNA之间的信息含量差异性以及短程关联为主特征是普适的,既不受实验数据来源的牵制,也不受linker DNA长度的影响。3.功率谱分析是识别DNA序列周期性信号的重要手段。酵母、果蝇、线虫基因组核小体core DNA和linker DNA的序列功率谱分析显示:三种模式生物的核小体core DNA序列内存在较明显的3-nt和10-nt周期性,且该周期信号强于相应的linker DNA序列。另外,我们也观察到了功率谱的物种特异性。4.为进一步阐明碱基关联性对核小体定位的影响,分别计算了描述16种特定二核苷碱基关联的参数Fk在core DNA和linker DNA的分布。以核小体定位序列的参数Fk(k=0,1,2,...98)对应的1,584(99×16)维向量作为输入特征的支持向量机能够较好的区分H. sapiens, O. latipes, C. elegans, C. albicans和S. cerevisiae的core DNA和linker DNA,预测平均总精度TA为76.05%,相关系数MCC为0.4876。5.基于linker DNA的四联体偏性M4(i)特征构建了预测酵母基因组核小体定位的PCSF算法。该算法可以较好的区分核小体core DNA和linker DNA,五折交叉检验的敏感性和特异性平均值分别达到94.42%和94.35%。我们也应用PCSF算法预测了酵母全基因组核小体占据率,预测的核小体占据率与Kaplan测定的体外核小体定位实验图谱的Pearson相关系数为0.761。预测的特定基因邻近区的核小体占据率也与实验结果较吻合。应用PCSF算法预测转录起始位点、转录终止位点、复制起始位点三类功能区域的核小体占据率图谱,能够识别出关键的核小体缺乏区。PCSF算法可以作为核小体定位理论预测的有效工具。

【Abstract】 Epigenetics is an important frontier and a new research hotspot in the post-genomic era. Nucleosome positioning is a major area of epigenetics. As the basic building block of higher-order chromosome structure, nucleosome not only provides the measure of packing genomic DNA, but also is involved with gene expression and regulation. By controlling DNA accessibility, nucleosome regulates various biological processes, such as DNA transcription, DNA replication, DNA recombination, DNA repair, mRNA splicing, and disease development, etc. Investigation of the nucleosome positioning across eukaryotic genomes can contribute to elucidate the formation mechanism of higher-order chromosome structure and is helpful for uncovering intricate gene expression and regulation.Nucleosome positioning along genome is determined by many factors including DNA sequence preferences, chromatin remodeling complex, transcriptional machinery, histone posttranslational modification, histone variant, etc. Intrinsic DNA sequence preferences of the nucleosome have been shown to be the most important factor over other factors recognized so far. Many theoretical and experimental studies of nucleosome positioning based on DNA sequence have existed. However, majority of theoretical studies paid more attention on core DNA wrapped around histone octamer and linker DNA between nucleosomes was given little attention. In this work, sequence characters of core DNA and linker DNA retrieved from high-resolution data of nucleosome positions were analyzed statistically. Based on DNA sequence signals, the novel position-correlation scoring function(PCSF) and support vector machine(SVM) were respectively developed to predict nucleosome positioning in eukaryotic genome.Firstly, the distribution of K-mer(K=1,2,...6) and sequence bias parameter Mk(i)(k=1,2,...6) were analyzed systematically in core and linker sequences across S. cerevisiae. The oligonucleotides composed A and T are more enriched in linker DNA than core DNA. The higher A+T content, the stronger rigidity of sequence. Thus, the lower content of oligonucleotides comprised of A and T in core DNA are helpful for DNA wrapping. The bias of the k-mer frequency Mk(i)(K=1,2,...6) in linker regions is drastically higher than that in core regions. In other words, the bias of the k-mer frequency or sequence conservation in linker regions is stronger than in nucleosome core regions. This result provides an important clue to prediction of nucleosome occupancy combined sequence bias parameter.Secondly, information redundancy Dk describes the vocabulary composition and grammar structure of genetic language. The calculated results of Dk across the genomes of S. cerevisiae, D. melanogaster, and C. elegans indicated that the value of Dk in core DNA is significantly different from that in linker DNA. The law of short-range correlation of the nucleotides is dominant in the nucleosome and linker DNA sequence was validated. This result probably decode the phenomenon that most of the theoretical models based on the frequencies of oligonucleotides or k-mer predicted nucleosome positioning with high accuracy. We also confirmed that the difference of information content between core DNA and linker DNA is universal. Neither the sequence length difference nor the difference of the method for constructing the dataset between the core DNA and linker DNA alters this difference.Thirdly, power spectrum analysis is a popular method for detecting periodicity in DNA sequences. Power spectrum of core DNA and linkr DNA acrocss S. cerevisiae, D. melanogaster, and C. elegans genomes showed that the3-nt and10-nt periodicities are obvious in the nucleosome DNA regions and are stronger than that in linker DNA regions for three model organisms. Besides, the specific power spectrum of different species was shown.Fourthly, to further clarify the effect of nucleotide correlation on nucleosome positioning, the parameter Fk(k=0...98) was calculated to examine particular base correlation corresponding to the16dinucleotides in core DNA and linker DNA. Using a1,584-element(99×16) vector as input vectors, the SVM was used to classify core and linker DNA regions in Homo sapiens, Oryzias latipes, C. elegans, Candida albicans, and S. cerevisiae. This model obtained a good performance with an average total accuracy of76.05%and an average MCC of0.4876in five organisms.Finally, a novel PCSF algorithm based on the bias of4-mer frequency M4(ⅰ) in linker sequences was developed to distinguish nucleosome vs linker sequences. The5-fold cross-validation demonstrated that this algorithm achieved a good performance with mean sensitivity of94.42%and specificity of94.35%. Next, the algorithm was used to predict nucleosome occupancy throughout the S. cerevisiae genome and a higher pearson correlation coefficient of0.761with the in vitro experiment nucleosome positioning map of16chromosomes was obtained. Besides, the nucleosome profiles surrounding specific gene are notably similar with experimental maps of nucleosome organization in vitro and in vivo. By analyzing the profiles of nucleosome occupancy predicted by PCSF in the vicinity of TSS, TTS and ACS, the pronounced nucleosome depleted regions can be confirmed. The results suggested that intrinsic DNA sequence preferences in linker regions have a significant impact on the nucleosome occupancy and PCSF algorithm is an effective tool to predict nucleosome positioning.

  • 【网络出版投稿人】 内蒙古大学
  • 【网络出版年期】2014年 09期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络