节点文献

复杂性状遗传分析方法研究及其软件开发

Developing Methods and Software for Genetic Analysis of Complex Traits

【作者】 杨剑

【导师】 朱军;

【作者基本信息】 浙江大学 , 作物遗传育种, 2008, 博士

【摘要】 阐明复杂性状的遗传机理是动植物遗传改良以及人类复杂疾病致病机理研究的重要基础。复杂性状遗传研究的本质是定位控制复杂性状遗传的QTL(数量性状基因座),检测QTL之间相互作用(上位性)并检测QTL以及上位性在不同环境条件下的效应差异,进而鉴别相应的候选基因及其调控网络。基于混合线性模型方法,本研究提出了适用于各种试验群体的全QTL模型和二步法的定位策略,将复杂性状的多基因遗传体系剖解为QTL的主效应,成对QTL位点的上位性效应,以及它们与环境因子的互作效应;并通过计算机模拟,以及对水稻各种性状的实例分析,验证了模型和定位策略的有效性和可靠性。在此基础上,首次提出了整合宿主和病原菌遗传信息的QTL定位模型,能同时检测宿主和病原菌基因组上控制抗病性和致病性的QTL以及宿主和病原菌之间的QTL互作。此外,本研究还将上述方法应用于基于DArT标记技术的QTL分析,并提出了利用QTL定位得到的遗传信息预测最优基因型的方法。另外,结合基因芯片表达数据和分子标记基因型数据,还提出了eQTL定位的方法。最后,基于上述模型和方法,开发了两套界面友好的计算机软件。本研究的主要研究内容和结论如下:1)对于重组自交系(RI)和双单倍体(DH)群体,本研究提出了一个全QTL模型,包括了多个QTL的加性效应,QTL间的互作效应(加×加上位性效应)、以及它们与环境的互作效应,用于探究复杂性状在多环境条件下的多基因遗传体系。并进一步提出了一个新的定位策略,包括分子标记区间分析、分子标记区间的互作分析以及基因组扫描,用于定位遗传群体中的多QTL位点以及QTL之间的相互作用。另外,本研究利用基于Henderson方法Ⅲ的F统计量做假设检验,采用Permutation方法来控制基因组水平的假阳性率,并运用基于Gibbs采样的贝叶斯方法来估算全QTL模型中的各种遗传参数。此外,本研究通过蒙特卡洛模拟来检验该方法的可靠性和有效性,并用两组真实数据(一组是小鼠BXD群体的嗅球重量数据,另一组是水稻的产量数据)来验证方法。2)对于F2和RIX(重组自交系随机交配,或称为IF2)群体,本研究将上述的全QTL模型拓展到包括QTL的加性和显性效应,加×加、加×显、显×加和显×显上位性效应,以及它们与环境的互作效应。通过计算机模拟研究不同RIX设计检测QTL和上位性的功效和假发现率,并通过实例分析(水稻产量和小鼠脑重)来验证方法的有效性。结果表明大部分的QTL位点都表现出一因多效性,而QTL之间的互作则往往对于不同的性状而表现不同。对于不同的性状,环境变异占表现型变异的比例有很大的差异。3)基于芯片的高通量基因型检测方法的发展,如DArT(多态芯片技术)和SNP(单核苷酸多态),为大规模增加遗传定位群体提供了重要机遇。本研究提出了一种基于DArT基因型检测系统的QTL定位策略。利用一个DH群体构建了SSR标记(简单重复序列)的低密度连锁图谱,并利用该DH群体的一个子群体构建了结合DArT标记和SSR标记的高密度连锁图谱。分别利用低密度连锁图谱和全群体以及高密度连锁图谱和子群体对大麦网斑病进行QTL定位分析,两者都能定位到一对相互作用的主效QTL。结果表明,高密度连锁图谱、小群体以及精确的表现值度量可以提高主效QTL定位的精确度。因此,可以通过DArT分子标记来检测大量子群体的基因型,从而提高QTL定位试验的效率。4)基于上述方法得到的QTL效应信息,本项目把传统的仅包括加性和显性效应的育种值扩展到包括在各个环境下都稳定表现的遗传主效应和在特定环境有特殊表现的环境互作效应,并提出了一个逐步调整基因型的方法,可筛选集优良基因型于一体的最优基因型(最优纯系和最优杂交种),来预测群体的遗传改良潜力。对水稻单株粒重数据的分析结果表明,预测得到的最优纯系和最优杂交种都比双亲的F1世代有明显的优势,而且这种优势很大程度上是由上位性效应以及QTL与环境互作效应贡献的。5)在基因组对基因组假设前提下,本研究提出了一个同时整合宿主和病原菌遗传信息的遗传模型,用于检测宿主和病原菌基因组上控制宿主抗病性状的QTL位点,以及宿主和病原菌之间的QTL互作。将候选的分子作为背景控制,通过一维基因组扫描方法同时检测宿主和病原菌基因组上的主效QTL位点,然后通过二维基因组扫描方法检测宿主和病原菌基因组内的上位性以及宿主和病原菌基因组间的QTL互作。在检测主效QTL和互作QTL的过程中,都采用Permutation技术来控制试验水平的假阳性率,并通过蒙特卡罗模拟验证模型和方法的有效性和可靠性。模拟结果表明,该方法能较好的估计模型中的各项遗传参数,并有足够的统计功效来检测主效QTL以及QTL间的互作。6)提出了鉴定差异表达基因的方法,该方法适用于包含单处理因素或双处理因素的基因芯片试验,同时也能分析非平衡数据。采用基于Henderson方法Ⅲ的F统计量来检验每个基因在不同处理水平下的表达差异,并通过调整P值的阈值来控制试验水平的假发现率。分析了人类急性白血病的表达谱数据(包含38个临床诊断的白血病人样),与SAM(significance analysis of microarray)和MAANOVA(microarray analysis of variance)的分析结果相比,对于单处理因素的数据,本研究提出的方法对与MAANOVA方法非常接近,但MAANOVA方法无非直接的处理缺失数据。另外,还分析了2个小鼠纯系6个脑区域(双处理因素)的表达谱数据,与比前人的分析结果相比,本研究提出的方法能够检测到更多的脑区域特异性表达模式。7)将基因芯片所获得的基因表达值作为一种特殊类型的复杂“性状”,本研究发展了一种定位eQTL和eEpistasis(控制基因表达的上位性)的新方法。该方法将事先筛选到的分子标记作为背景控制,通过一维基因组扫描检测主效eQTL,然后再进行基因组扫描,检测主效eQTL与基因组上其他任意位点的上位性互作,并通过调整P值的阈值控制上述两个检测过程的假发现率。此外,分析了一组由C57BL/6J和DBA/2J组合衍生的重组自交系数据来验证该方法。8)最后,本项目开发了两套计算机软件,QTLNetwork和QTModel,用于数据分析。QTLNetwork软件用于定位和图示化多环境下的复杂性状多基因遗传体系。该软件目前适用于F2、BC(回交一代)、RI、RIX(或称为IF2)以及BCnFn(多次回交和自交)等试验群体。QTModel软件分为三个模块:mixed、array和diallel。其中,mixed模块用于常规的包含随机因素的试验设计,如:随机区组设计、析因设计、多因素析因设计、巣式设计和相交巣式设计等;array模块用于分析包含单处理因素或双处理因素的基因芯片数据,检测差异表达基因;而diallel模块则用于经典的双列杂交设计。

【Abstract】 Understanding the genetic basis of complex trait is of key importance for genetic improvements of crops and domestic animals, and helpful to elucidate the genetic aetiology of human complex diseases. The essential issues in genetic analysis of complex trait is to map quantitative trait loci (QTLs) that affect the inheritance of complex trait, detect the interaction among QTLs (epistasis) and the differences of the effects of QTLs and epistasis in different environmental conditions, and consequently to identify the candidate genes underlying complex trait and their genetic regulatory network. Based on mixed linear model approaches, a full-QTL model and a two-step mapping strategy were proposed for linkage analysis of segregating populations to dissect the genetic architecture of complex trait into the effects of individual QTLs and epistatic interaction between pair-wise loci, and the interaction effects between QTLs (or epistasis) and environmental factors. Simulation study and analysis of rice and mice data were performed to validate reliability and efficiency of the proposed method. Subsequently, a novel genetic model that integrates the genetic information of both host and parasite was proposed to map disease-related QTLs on host and parasite genome simultaneously, as well as to investigate the interaction among these QTLs. In addition, the aforementioned method was extended for QTL analysis based on high-throughout genotyping technology (e.g. DArT), and an approach was developed to predict superior gentoypes utilizing the genetic information obtained from QTL analysis. Furthermore, combined with the gene expression and genotyping data of segregating population, a new approach was proposed for mapping expression QTLs (eQTLs) and detecting the epistatic interaction between a main-effect eQTL and any other loci. Finally, two software packages were develop to implement the aforementioned methodologies. The main features of the proposed methods and results are summarized as follows:1) For homogenous mapping panels, such as recombinant inbred (RI) and double-haploid (DH) populations, a full-QTL model was proposed to explore the genetic architecture of complex trait in multiple environments, which includes the additive effects of multiple QTLs, additive x additive epistatic effects, and their interaction effects with environments. A mapping strategy, including marker interval selection, detection of marker interval interactions, and genome scans, was used to evaluate the putative locations of multiple QTLs and their interactions. An F-statistic based on Henderson method III was used for hypothesis test. In each of the mapping procedures, permutation testing was exploited to control for genome-wide false positive rate, and model selection was used to reduce the ghost peaks in F-statistic profile. Parameters of the full-QTL model were estimated using a Bayesian method via Gibbs sampling. Monte Carlo simulations were conducted to illustrate the reliability and efficiency of the method. Two real datasets (BXD mouse olfactory bulb weight and rice yield), were used as worked examples to demonstrate the proposed methods.2) For heterogeneous mapping panels, such as F2 and recombinant inbred intercross (RIX, or say IF2) populations, the aforementioned full-QTL was extended to include the additive and dominance effects of QTLs, epistatic effects (additive x additive, additive x dominance, and dominance x dominance), and their interaction with environments. A series of simulations were conducted to investigate the powers and false discovery rates of QTL and epistasis with different RIX designs. Two real datasets, one from mouse and the other one from rice, were analyzed to illustrate the validity of the proposed method. Results revealed that more than a half number of QTLs show pleiotropic effects, while epistasis seems to be independent for different traits. The proportion of phenotype variation attributed by environmental effects differed considerably for different traits.3) The development of array-based high throughput genotyping methods (e.g. diversity arrays technology DArT and single nucleotide polymorphism SNP) created significant opportunities to increase the number of genetic populations for genetic linkage analysis. A strategy was proposed for mapping of QTLs based on the DArT genotyping system. A procedure was illustrated for constructing a consensus linkage map consisting of both DArT and SSR markers by utilizing a sub-group DH population, and a second linkage map constructed with SSR markers alone and a more extensive full DH population. Resistance to barley net type net blotch disease was analyzed using the sub-population data with the high-density consensus linkage map and the full-population data with the low-density SSR linkage map, respectively. Two interactive QTLs were detected either by the sub- or full-population. The results indicated that high density molecular markers, small population size and precise phenotyping could improve the precision of mapping major-effect QTLs and the efficiency of conducting QTL mapping experiment.4) In addition, methods were developed for predicting two kinds of superiorgenotypes (superior line and superior hybrid) based on QTL effects including epistatic and QTL x environment interaction effects. Mathematical formulae were derived for predicting the total genetic effect of any individual with known QTL genotype derived from the mapping population in a specific environment. Two algorithms, enumeration algorithm and stepwise tuning algorithm, were used to select the best multi-locus combination of all the putative QTLs. Grain weight per plant (GW) in rice was analyzed as a worked example to demonstrate the proposed methods. Results showed that the predicted superior lines and superior hybrids had great superiorities over the F1 hybrid, indicating large breeding potential remained for further improvement on GW. Results also indicated that epistatic effects and their interaction with environments largely contributed to the superiorities of the predicted superior lines and superior hybrids.5) Under a hypothesis that the host-parasite interaction system was governed by genome-for-genome interaction, we proposed a genetic model that integrates genetic information from both of the host and parasite genomes. The model could be used for mapping quantitative trait loci (QTLs) conferring the interaction between host and parasite and detecting interactions among these QTLs. A one-dimensional (1D) genome scan strategy was used to map QTLs in both of the host and parasite genomes simultaneously conditioned on selected pairs of markers controlling the background genetic variation; a two-dimensional genome scan procedure was conducted to search for epistasis within the host and parasite genomes and interspecific QTL×QTL interactions between the host and parasite genomes. Permutation test was adopted to calculate the empirical threshold for controling the experimental-wise false positive rate of detected QTLs and QTL×QTL interactions. Monte Carlo simulations were conducted to examine the reliability and efficiency of the proposed models and methods. Simulation results illustrated that our methods could provide reasonable estimates of the parameters and adequate powers for detecting QTLs and QTL×QTL interactions.6) A statistical procedure was proposed to identify the differentially expressed genes (DEGs) for gene expression data with or without missing observations from microarray experiment with one- or two-treatment factors. An F-statistic based Henderson method III was constructed to test the significance of differential expression for each gene under different treatment(s) levels. The cutoff P-value was adjusted to control the experimental-wise false discovery rate. A human acute leukemia dataset corrected from 38 Leukemia patients was re-analyzed by the present method. In comparison to the results from SAM (significant analysis of microarray) and MAANOVA (microarray analysis of variance), it was indicated that the present method has similar performance with MAANOVA for data with one-treatment factor, but MAANOVA can not directly handle missing data. A mouse brain dataset collected from six brain regions of two inbred strains (two-treatment factors) was re-analyzed to identify genes with distinct regional-specific expression patterns. The results showed that the proposed method could identify more distinct regional-specific expression patterns than the previous analysis of the same dataset.7) Considering the gene expression values as a special kind of complex "trait", a novel method was proposed to identify genetic polymorphisms (or say eQTLs) and their interaction that affect gene expression. The method started with a 1D genome scan procedure to search for eQTLs with individual effects conditioning on previously selected candidate markers to control the background genetic variation. After that, each main-effect eQTL was tested for the interaction effect with any other loci with or without individual effects. In the procedure of detecting main-effect eQTL or of detecting genetic interaction, the cutoff P-value was adjusted to control the experimental-wise false discovery rate. A mouse dataset collected from a group of RI strains derived from two ancestor parents, C57BL/6J and DBA/2J, was analyzed to illustrate the utility of the proposed method.8) Two software packages, QTLNetwork and QTModel, were developed by C++ programming language for implementation of the aforementioned methodologies. QTLNetwork was developed for mapping and visualizing the genetic architecture underlying complex traits for experimental populations in multiple environments. It can handle data from F2, backcross, recombinant inbred lines and double-haploid populations, RIX (or say IF2) and BCnFn populations. QTModel has three modules, mixed, array and diallel. The mixed module was developed for analyzing data from experimental designs with random factors, such as randomized block design, factorial design, multi-factor factorial design, nested design, and cross nested design etc. The array module has the capability of analyzing microarray expression data with one- or two- treatment factors for differentially expressed genes. The diallel module was developed for analyzing the data from classical diallel cross designs.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2008年 09期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络