节点文献

基于混合线性模型进行遗传数据分析的异常值检测方法

Detection of Influential Observations in Mixed Linear Models Using Two Types of Estimation and Prediction Methods for Genetic Data Analysis

【作者】 尤萨夫

【导师】 朱军;

【作者基本信息】 浙江大学 , 统计遗传学和生物信息学, 2008, 博士

【摘要】 利用混合线性模型进行遗传数据分析对于统计学家和遗传学家来说都是一种挑战,因为无论是线性、二次性还是似然估计方法都会在很大程度上受到自变量或依变量中的异常数值的干扰。要了解异常值对分析结果的影响,唯一的方式是通过反复地数据质量鉴定和模型优化。基于上述考虑,本研究借助于MINQUE(最小二次范数无偏估计)和AUP(调整的无偏预测)方法(表示为:方法Ⅰ),提出了利用混合线性模型进行遗传数据分析的异常值检测方法,并将该方法与基于EM算法和BLUP(最佳线性无偏预测)的方法(表示为:方法Ⅱ)进行比较,然后通过两个实例分析来验证方法。本研究首先利用一个常用的遗传模型(包括品种、年份和地点)来演示该方法,并引入一组统计量来评价异常值对分析结果的影响程度,如:Cook距离(CD(β)),Andrews-Pregibon统计量(AP),Cook-Weisberg统计量(CW)和方差比例(VR)是用来评价某个数据点对混合线性模型种固定效应的影响;而Cook距离(CD(e))是用来评价某个数据点对随机效应的影响。采用C++编程语言编写了计算机模拟程序,通过蒙特卡罗模拟方法产生模拟数据,随机设定若干异常值,并运用本研究提出的方法来检测异常值,来检验方法的有效性和可靠性。结果表明,利用上述的异常值评价指标,方法Ⅰ和方法Ⅱ都能够检测到模拟数据中人为设定的异常值,两者具有相似的异常值检测能力。此外,本研究还运用方法Ⅰ和方法Ⅱ对不含有异常值的数据进行分析,来比较两种方法的假阳性率。结果表明,与方法Ⅱ相比,利用方法Ⅰ所得到的异常值评价指标更加平稳,因此,方法Ⅰ在异常值检测方面更加稳健。另外,在模拟数据中,针对特定品种、年份和地点的组合设定异常值。大多数情况下,方法Ⅰ和方法Ⅱ都能检测到这类异常值,对于有些例子,方法Ⅰ能够具有更强的检测能力,而对于另一下例子,方法Ⅱ则表现的更好。主要分析结果可总结如下:1)本研究提出的方法可以较好地检测出混合线性模型中的异常表型值。如果模型中只存在少量离散的异常观察值,无论用方法Ⅰ还是用方法Ⅱ,都能检测到这些异常值。但如果一个品种在同一地点、同一年份存在多个异常值,则无法检测到这些异常值,反正会将正确的观察值判定为异常值。2)基于上述方法,本研究采用C++编程语言编写了一套计算机程序,用于混合线性模型的遗传数据分析,检测异常观测值,并根据统计检验P值的大小来排列异常值。这套程序也可以提供模型中方差分量的估计值和随机效应的预测值。3)在常用遗传模型的分析结果中,有些值异常值会由于其他异常值的掩盖而无法被检测出来,而有些正常的观察值则会由于其它多个异常值的影响而被误认为是异常值。4)在常用遗传模型的分析实例中,异常值的存在可能会严重影响固定效应的估计和随机效应的预测,而去掉这些异常值之后,则可能在很大程度上改进模型的参数估计。对于QTL定位数据,去除异常值之后,可以检测到额外的QTL,并能改进遗传率的估计。两个实例分析的结果都表明,去除异常值之后,都能改进模型的参数估计,当然,我们并不能武断地认为这些去除异常值完全没有生物学意义。5)另外,我们可以将本项目提出的方法拓展到复杂的遗传模型,如:加显模型,加显-母体效应模型等,来分析异常值对遗传效应以及非遗传效应的影响。另外,我们也可以将该方法应用于基因芯片数据分析,来检测芯片数据采集过程中由于机器校准、数据输入以及编码造成的异常数据。

【Abstract】 Mixed linear models for genetic data analysis is one of the most challenging problems for statisticians as well as geneticists, because it traditionally focused on linear, quadratic and the likelihood estimation methods which are not robust to aberrant cases in response as well as in the factor space. Vibrant inspection, through quality data check and model specification is the only way in understanding the effect of unusual data points on the results of analysis. Keeping this notion in mind, the present study was conducted to propose a technique in the framework of adjusted unbiased prediction (AUP) via minimum norm quadratic unbiased estimation (MINQUE) method (say, Method-I) for detection of unusual data points in mixed linear models for genetic data analysis. The proposed method was compared with the best linear unbiased prediction (BLUP) via expectation and maximization (EM) algorithm (called, Method-II) for checking its validity. In addition, to address the consequence of influential observations and outliers in biological research to two real data sets.A general genetic model was considered to illustrate the proposed method and to compare it with the existing methods by taking into account various influence diagnostic statistics. Four influence diagnostic statistics i.e. the analogue of Cook distance (CD(β)), Andrews-Pregibon statistic (AP) , Cook-Weisberg statistic (CW) and variance ratio (VR) were applied for detecting influential data points influencing the fixed affects of a mixed linear model; while the analogue of Cook distance (CD(e)) was used for inspecting the influential data points affecting the random components of the aforementioned model. To check the efficacy and reliability of the proposed method, Monte Carlo simulations were conducted for variable setting of aberrant observations in the phenotype data of a general genetic model. All these simulations were performed by a program written in C++ programming language. It was not rigorously proved that Method-I perform better as compared to Method-II and vice versa. Almost the same detection ability and trends regarding the presence of aberrant observations in the response were recorded from both the methods, using the aforementioned influence diagnostic statistics for the influence of i-th data point influencing the fixed and random components of a mixed linear model.In the present study, both the methods were compared for the false positive rate by taking a clean data set. The values of each influence diagnostic statistics for the influence of fixed and random components of a general genetic model (mixed linear model) were more clustered under the Method-I as compared to Method-II. It indicates the robustness of a proposed method (Method-I) in the presence of unusual observations and built our confidence that it will perform better in identifying aberrant observations. In simulation, for different perturbation in the phenotype data with regard to various genotype(s), location(s) and year(s), it was observed that our approach showed the same trend, very nice resemblance and in agreement with the Method-II under a variety of influence diagnostic statistics. However, in some of the situations, Method-I showed larger magnitudes for some of the influence statistics and vice versa.The main results from the simulations and the real data sets are summarized as follow:1. Our approach is verified to perform well in identifying the aberrant observation in the response vector of mixed linear model, if exists. If their is only one aberrant observation in the phenotype data, regarding any genotype corresponding to either location or year, it could be successfully detected using either of the influence diagnostic statistics under both the methods. If their exist multiple influential observations in the phenotype data of a general genetic model, some of them could be effectively detected by both the methods while for others, the influence diagnostic statistics will show some sort of noise.2. A program written in C++ programming language is developed to identify the influential observations and outliers in the data analysis of a general genetic experiment in the framework of mixed linear model. The program also provides the estimates of variance components and prediction of random effects involved in the model. In addition, the significance (P-value) of each individual observation in a data set.3. The results of general genetic model, analyzed in the framework of mixed linear model showed both the masking and swamping effects in the presence of multiple unusual data points in the phenotype values.4. In worked example (general genetic experiment), it was observed that the presence of influential observations and outliers can badly distort the estimates of variance components and prediction of random effects (breeding values). The removal of these data points can bring drastic change in the parameters’ estimates of a mixed linear model and provide useful results. In QTL mapping data, the results demonstrate that clean data set give ways in identifying additional QTLs with individual effects; and improved estimates of phenotypic variation (heritability), and particularly that of residuals can be obtained in the absence of influential observations and outliers. In general, it was observed, in both the data sets analyzed, that the removal of influential observations and outliers can bring substantial change in the estimates of various parameters of a mixed linear model. However, it is not claimed that biologically outliers and influential observations may not be good data points.5. The method can be easily extended to more complex genetic models i.e. additive dominance, additive dominance maternal models etc. for studying the effect of unusual data points on variable genetic and non-genetic effects involved in the mixed linear model. In addition, it can also be used in microarray data analysis based on mixed linear model approach to identify the hidden peculiarities caused by machine or data entry or recording errors, or might be possibility of differentially expressed (not expressed) genes.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2008年 09期
  • 【分类号】Q3
  • 【被引频次】1
  • 【下载频次】426
节点文献中: 

本文链接的文献网络图示:

本文的引文网络