节点文献

基于表型以及微阵列数据的基因(型)分类技术研究

Phenotype and Microarray Data-based Clustering Analysis of Genotypes or Genes

【作者】 肖静

【导师】 徐辰武;

【作者基本信息】 扬州大学 , 作物遗传育种, 2007, 博士

【摘要】 分离分析(Segregation Analysis, SA)是直接根据分离群体数量性状的表现型检测主基因是否存在并估计其效应的一种统计遗传分析方法,是进一步进行QTL作图和基因组分析的基础。在数量性状主基因和微基因独立的遗传假定下,同一主基因基因型将呈现连续性的正态分布,不同主基因基因型则将是具有不同平均数和相同方差的多个正态分布的混合。因此,分离分析通过高斯混合模型的构建、参数的极大似然估计以及似然比检验统计量的计算,从而实现主基因的效应估计和各种遗传假设测验。然而,现有的分离分析方法均是基于单一性状进行的,主基因的统计功效较低。为此,本研究提出一种多性状主基因联合分析方法—多元分离分析方法(Multivariate Segregation Analysis, MSA),MSA可以充分利用多个数量性状间的遗传相关和剩余相关信息,因此有望提高主基因的检测功效,以及剖析复杂性状的遗传结构。MSA通过建立多个多元高斯分布的混合模型,采用EM算法实现的极大似然估计方法进行主基因的分离比例、主基因效应和剩余变异估计,以似然比测验统计量进行主基因的各种遗传假设检验,以一因多效、独立遗传和紧密连锁3种可能模型下的贝叶斯信息准则(Bayesian Information Criterion, BIC)来区分主基因是一因多效还是紧密连锁。为了验证方法的可行性,模拟研究以F2群体为例设置了两套模拟实验,模拟实验1研究不同主基因遗传力和样本容量下MSA的统计功效、主基因效应和剩余变异估计的准确度和精确度。模拟实验2研究不同遗传力下MSA区分一因多效主基因或紧密连锁主基因的能力。计算机模拟研究结果表明:(1)无论主基因是同时控制多个性状的表达,还是仅控制其中一个性状的表达,由于联合分析充分利用了性状之间的相关信息,MSA均可以显著提高主基因的被发现能力。(2)MSA可以显著增加主基因效应估计值的准确度和精确度,通常来说,只要主基因的检测功效高达50%以上,其相应估计值的准确度和精确度均可达到较理想水平。(3)MSA还能够有效的区分多性状是受一个主基因控制还是受紧密连锁的多个主基因控制。(4)对遗传力和样本容量两个影响主基因检测功效的关键因素来说,其作用效果则是遗传力明显大于样本容量。以水稻杂交组合多蘖矮×中花11的F2群体597个植株株高和分蘖数为例演示了分析程序。结果表明该组合的株高和分蘖数受同一主基因控制。该主基因对株高的加性和显性效应分别为-21.3 cm和40.6 cm,表现为超显性;对分蘖数的加性和显性效应则分别为22.7和-25.3,表现为接近完全显性。上述MSA不仅可以估计模型中的遗传参数,而且可计算出每个个体属于不同主基因基因型的后验概率,因此,本研究提出根据个体的贝叶斯后验概率进行个体分类的新方法,即一种基于模型的非监督动态聚类方法。该方法同样是以EM算法实现的极大似然估计方法实现各个类参数估计,以个体所属类别的贝叶斯后验概率判别个体的归类。模拟研究结果表明:(1)该方法通常既可无偏估计类参数又可根据各种模型的BIC值确定最佳分类个数,从而解决传统动态聚类法类数难确定的问题。(2)与重心法动态聚类(k-means)和最小组内平方和法(Minimum Square Sum Within Groups, MinSSw)动态聚类相比,稳健性较高。(3)通过提高判别标准,可以有效降低误判率(Misclassified Rate, MR)。以Fisher的Iris试验数据验证了方法的可行性,分析结果表明基于似然函数极大为目标的非监督动态聚类方法特别适于原始数据为高斯分布的数据聚类,其误判率显著低于k-means和MinSSw法。DNA微阵列技术是后基因组时代功能基因组研究的主要工具之一,它可以一次同时测出不同实验环境或不同组织的成千上万个基因的表达水平。将相似表达模式的基因聚在一个类中的基因聚类分析,是提取基因表达谱数据潜在生物学信息的有用工具,同时也是微阵列数据分析中使用最为广泛的一类方法。聚类技术依据先验信息的有无,又可分为非监督聚类和监督聚类。为了探讨上述基于模型的聚类方法应用于高维微阵列表达谱数据分析的可行性,分别用计算机模拟数据、酵母细胞周期微阵列数据以及人类癌细胞NCI-60微阵列数据进行聚类分析,并与k-最近邻居法(k-Nearest Neighbour, KNN),二分类支持向量机器(Supprot Vector Machines, SVMs)以及多分类SVMs(Multicategory SVMs, MC-SVMs)法分析结果进行比较,采用假阳性(False Positive, FP)、假阴性(False Negative, FN)、聚类的准确性以及马修斯相关系数(Matthews’Correlation Coefficient, MCC)等指标比较不同监督聚类方法的优劣及其适用场合。结果表明:(1)对成千上万基因表达谱数据,基于模型的聚类法聚类准确性最高,且在训练样本容量较小的情况下,同时利用已知基因和未知基因的先验信息指导未知基因归类的基于模型的监督聚类法,比仅利用已知基因的信息指导未知基因归类的基于模型的判别分类准确性要高,但运算速度较慢。(2)相比较而言,MC-SVMs法稳健性较高,适用性最广,其对高维数据不敏感。不仅适用于成千上万基因表达谱数据的聚类,聚类准确性仅次于基于模型的监督聚类法;而且适用于以成千上万基因作为指标对少数几十个样本的聚类,聚类准确性最高。(3)几种MC-SVMs法的表现,在样本容量较大时,宜采用OVO(One-versus-one)和DAGSVM(Directed Acyclic Graph SVM)法;样本容量较小时,OVR(One-versus-rest)、WW(Method by Weston and Watkins)和CS(Method by Crammer and Singer)法聚类准确性和MCC值较高;样本容量适中时,5种MC-SVMs表现一致。(4)建议根据数据的特征以及实验需要,同时选用至少两种方法进行试算,以便获得最佳聚类结果。

【Abstract】 Segregation analysis (SA) is a statistical genetic method directly using the phenotype of quantitative traits in segregation population to detect the existence of major genes and estimate their effects. It serves as an important tool in helping investigators to plan further studies such as quantitative trait loci mapping or more sophisticated genomic analyses. Under the assumption that the major gene effects and polygenic effects are independent, the individuals with the same major gene genotype are expected to be normally distributed, whereas individuals with different major gene genotypes could follow a mixture of normal distributions with different means and the same variance. Therefore, the estimation of major gene effects and genetic hypothsis testing in SA were implemented through the construction of Gaussian mixture model, the maximum likelihood (ML) estimation of parameters and the calculation of the likelihood ratio test (LRT) statistics.However, current methods of SA for a single trait typically have low statistical power. In this study, we propose a joint analysis method for multiple traits, i.e., multivatiate segregation analysis (MSA) that takes advantage of the genetic and residual correlation information of multiple quantitative traits to detect major genes. It is hopeful that this method not only increases the statistical power, but allows dissection of the genetic architecture underlying the trait complex. In MSA the observed phenotypes of multiple correlated traits are fitted to a multivariate Gaussian mixture model. The separated proportion, major gene effects and residual variabilities are estimated under the ML framework via the expectation-maximization (EM) algorithm. Various genetic hypothesis tests of major genes are tested using LRT statistics. Pleiotropy is distinguished from close linkage by comparing three possible models using the Bayesian information criterion (BIC). Three models are the complete pleiotropic model, the linkage model and the non-linkage/independent model respectively. Two simulation experiments were performed based on the F2 mating design to validate the feasibility of this method. In the first, the statistical powers and the accuracy and the precision of genetic effects along with residual variabilities of MSA under varying heritabilities and sample size were investigated. In the second simulation the efficacy of MSA in separating pleiotropy from close linkage under varying heritabillities was demonstrated. The results of extensive simulation showed (1) MSA increases the statistical power of major gene detection, due to MSA made best use of the correlation among traits, whether the simultaneous monitoring the expression of multiple traits or only monitoring the expression of a single trait among these traits by major gene. (2) MSA improves the precision and accuracy of major gene effect estimates. In general, if only the statistical power of major gene is higher than 50%, the precision and accuracy can arrive at the ideal value. (3) The efficacy of MSA to separate pleiotropy and close linkage was demonstrated. (4) Although both the heritability and sample size are key factors affecting the statistical power in the detection of major genes, it was found that the statistical power can be much better improved with the increased heritability than sample size. An example of the plant height and tiller number of F2 population in rice cross Duonieai×Zhonghua 11 was used in the illustration. The results indicated that the genetic difference of these two traits in this cross involves only one pleiotropic major gene. The additive effect and dominance effect of the major gene are estimated as -21.3cm and 40.6cm on plant height, and 22.7 and -25.3 on number of tiller, respectively. The major gene shows overdominance for plant height and close to complete dominance for number of tillers.The above MSA not only estimates the genetic parameters in model, but also can calculate the posterior probabilities of each individual belong to different major genotypes. Thus, in this paper, we introduced a new method, namely model-based unsupervised dynamic clustering method, which classified individuals according to the Bayesian posterior probabilities. In this method the parameters of different clusters were also estimated by the ML method implemented via EM algorithm and the individuals were classified by the Bayesian posterior probabilities. The outcomes of the simulation experiments clearly demonstrated. (1) The proposed method not only unbiasedly estimated the corresponding cluster parameters but also determined the optimum clustering numbers by BIC, which solving the great dilemma of deciding the number of cluster in traditional dynamic cluster methods. (2) Compared with the k-means method and the minimum square sum within groups (MinSSw) method, the proposed method was more robustness. (3) Moreover, the misclassified rate (MR) could be reduced by using stricter discrimination criterion. The proposed method was further validated by Fisher’s Iris dataset and the result indicated that the unsupervised dynamic cluster method implemented through the maximum of the likelihood function especially fits the data generated from Gaussian distribution, because the proposed method had a significant lower MR compared to the k-means and MinSSw methods.DNA microarray technology is the chief tool for functional genome research in the post-genomics era, which allowed the simultaneous monitoring of expression levels in cells of thousands of genes under varying experimental environment or biological tissue. Grouping gene having similar expression patterns is called gene clustering, which has been proved to be a useful tool for extracting underlying biological information of gene expression data. Also, it is the useful and most widely used method of microarray data analysis. Depending on whether or not the prior knowledge is used, the clustering methods could be classified into unsupervised clustering and supervised clustering. To explore the feasibility of the application of the above model-based cluster method to the analysis of high-dimension Microarray expression data, several typical supervised clustering methods, i.e., Gaussian mixture model-based supervised clustering, k-nearest-neighbor (KNN), binary support vector machines (SVMs) and multicategory support vector machines (MC-SVMs), were employed to classify the computer simulation data, yeast cell cycle microarray data and 60 human cancer cell lines (NCI-60) microarray data. False positive, false negative, true positive, true negative, clustering accuracy and Matthews’correlation coefficient (MCC) were compared among these supervised methods. The results are as follows. (1) In classifying thousands of gene expression data, the performances of model-based cluster methods have the maximal clustering accuracy. Furthermore, when the number of training sample is very small, the clustering accuracy of model-based supervised method have superiority over model-based discrimination method only using the information of known functional gene to guide the classified of unkonw functional gene, whereas the former simultaneous using the prior knowledge of known functional genes and unknown functional genes to guide the classified of unknown functional genes. But insofar as the computational speed was concerned, discrimination method is quicker than model-based method. (2) In general, the superior classification performance of the MC-SVMs is more robust and more practical, which are less sensitive to the curse of dimensionality and not only inferior to model-based method in clustering accuracy to thousands of gene expression data, but also more robust to a small number of high-dimensional gene expression samples than other techniques. (3) Of the MC-SVMs, OVO and DAGSVM perform better on the large sample sizes, while five MC-SVMs methods have very similar performance on moderate sample sizes. In other cases, OVR, WW and CS yield the better results when sample sizes are small. (4) We recommend that at least two candidate methods choosing based on the real data features and experimental conditions should be performed and compared to obtain better clustering result.

  • 【网络出版投稿人】 扬州大学
  • 【网络出版年期】2007年 06期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络