节点文献

分散度量模型中的变量选择

Study on the Variable Selection Problems in Dispersion Modeling

【作者】 王大荣

【导师】 张忠占;

【作者基本信息】 北京工业大学 , 概率论与数理统计, 2009, 博士

【摘要】 建模过程中的一个重要问题是如何从众多解释变量当中选取重要变量,即变量选择问题.已有大量文献从不同的角度研究了线性模型和广义线性模型中的变量选择问题.随着科学技术的深入发展,人们面临着越来越复杂的数据和模型结构,多重回归模型是其中重要的一类,它可以更好的解释数据变化的原因和规律.然而,当前文献大多集中于均值回归模型的变量选择,对分散度量参数赋予一个模型结构后,关于均值-分散度量参数联合建模结构下的变量选择问题却鲜有研究.我们的研究发现,如果把适用于均值模型的方法直接套用到联合建模结构中有可能会引起一些问题或做出错误的推断,因此有必要针对这样的复杂模型结构展开相关变量选择问题的研究.本文研究了均值和分散度量参数联合建模结构下的变量选择问题,以及变量选择思想方法的应用问题,主要取得了以下三点成果.针对异方差回归模型,我们研究了均值和方差联合建模结构下的同时变量选择问题.当均值模型中参数个数相对样本量较大时,方差模型中参数的极大似然估计通常是有偏的,使用这样的估计值进行变量选择将会增加模型的风险.从修正偏差的角度出发,我们采用了调整的profile似然函数作为损失函数,并基于信息论的理论基础,提出了一个新的变量选择准则PICa.与经典方法不同的是,该准则同时考虑了均值模型和方差模型中的信息,并对不同模型中的变量施以恰当的惩罚力度,达到了同时选择变量的效果.我们证明了,在一定的正则条件下,该准则具有如下渐近优良性:对均值模型,PICa准则具有模型选择的相合性;对方差模型,当样本量足够大时,由PICa准则选出的模型出现拟合不足现象的概率趋于零.Monto Carlo模拟研究显示,在许多常见情况下,新的准则优于传统方法.针对双重广义线性模型,一方面,我们针对经典的变量选择方法,利用扩展拟似然函数,推广了经典的AIC准则,并通过模拟和实例分析验证了该准则的有效性.另一方面,我们还研究了高维数据中的变量选择问题.当变量个数较大,而数据量不够大时,传统的子集选择法很难区分众多的可能模型,同时因其计算量太大而难以实施.对双重广义线性模型,不仅要估计均值模型中的参数,还要估计散度模型中的参数,计算将更加繁重.我们提出了一类非凹惩罚扩展拟似然方法,证明了所得估计具有Oracle性质,并提出了一种快速的新算法.同时,考虑到估计的优良性质依赖于罚函数中调谐参数的选择,我们从模型选择的相合性角度出发,改进了罚函数中调谐参数的选取方法.“变量选择”的思想方法作为建模的主要组成部分,对于衡量数据与模型拟合的程度具有本质的反映,因此,也可以用于建模的其他问题.我们针对回归分析中异常数据和变量变换相互影响的问题,从变量选择角度,结合模型选择的广义信息准则与构造变量方法,提出了一类数据变换与异常点的同时诊断方法.该方法同时考虑由是否存在异常点以及是否需要变换所组成的四种备选模型,在某些情况下,既可以减轻异常点对数据变换的强影响,又避免了变换数据对于异常点的掩盖效应.文章通过模拟与实例验证了该方法的有效性,并与文献中的方法进行了比较.

【Abstract】 Variable selection is fundamental to statistical modeling.A large number of researchers have been devoting into the variable selection problems.With the development of modern technology.more and more complicated data and models have emerged.Hierarchical regression models which can analyze data better are the important part of them.However.many references are concerned with the variable selection of the mean regression model.and there are few methods proposed for the mean and dispersion joint modeling.According to our research,we find that the methods of variable selection which are adequate for mean models may fail to be directly extended to the hierarchical regression models.Thus,it is necessary to study the variable selection problems for complicated models.This dissertation is concerned with the study on variable selection problems of mean and dispersion joint modeling.Purthermore,the idea of variable selection is applied to the data diagnosis field.Our research results include the following three conclusions.Fot the heteroscedastic regression models,the simultaneous variable selection for mean model and variance model is discussed.When the number of mean parameters is a large fraction of the sample size,the MLEs of variance parameters can be seriously biased.And the model risk would be increased based on such estimators.And we propose a criterion named PICa based on the adjusted profile log-likelihood function which has been used to reduce the bias of the variance component estimators.Our method is different from the conventional ones in that it combines the information of mean model and the inrormation of variance model. and PICa put suitable weights on mean and variance variable penalty.Thus it can simultaneously select the variables for mean and variance models.Under regular conditions.we prove that PICa has the following asymptotic properties:for the mean model,PICa is consistent for model selection;and for the variance model, the probability of underfltting is zero.Monto Carlo simulations show that PICa performs better than conventional methods in many usual situations.For the double generalized linear models,on the one hand,we propose a variable selection criterion based on the extended quasi-likelihood.The new criterion is an extension of Akaike’s information criterion.And its performance is investigated through simulation studies and a real data application.On the other hand, the variable selection problems for high dimensional generalized linear models with dispersion modeling are studied.When there are many variables and data is not enough,subset selection methods may not distinguish the large numbers of candidate models,and it’s hard to put into practice for the heavy computations.We propose a class of non-concave penalized extended quasi-likelihood method,prove the Oracle property of the resulting estimates and put forward a new arithmetic for the new procedure.At the same time,considering that the property of estimates depends on the penalty function,we improve the choice of tuning parameters in the penalty function from the angle of consistency for model selection.As a part of modeling strategy,variable selection is an important tool to reflect the essence of data fitting.Thus,it can also be applied to other fields of statistical modeling.We focus on the mask effects between diagnosis of outliers and of response transformation in regression analysis.Based on the idea of variable selection,a simultaneous diagnosis method is proposed by constructing covariates and employing the generalized information criterion.The efficiency of the proposed approach is compared with naive methods throuch a Monte Carlo simulation and two examples.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络