节点文献

非参和半参回归模型的稳健和截面推断

Robust and Profile Inferences for Some Nonparametric and Semiparametric Regression Models

【作者】 李锋

【导师】 林路;

【作者基本信息】 山东大学 , 概率论与数理统计, 2010, 博士

【摘要】 在识别响应变量和预测变量的回归结构问题中,非参和半参回归模型因其良好的灵活性和(或)较好的可解释能力已经得到了深入的研究和广泛的应用。半参模型中部分线性模型是为一类常用的模型,它既保持了非参数模型的灵活性同时有具有参数模型良好的可解释性,特别的它还有效的避免了纯非参回归的“维数灾难”问题(curse of dimcnsionality)。近年来,在实际的医疗数据分析中,协变调整模型和变量选择问题已成为热点问题,引起了人们的极大关注。然而,非参回归模型中,普通核估计方法对窗宽选择敏感并且收敛速度也不尽如人意;协变调整的部分线性模型未有研究;如Fan和Li(2004)指出的部分线性模型的变量选择问题也少有研究。本篇论文中我们就针对这些非参和半参回归模型的相关问题进行研究。具体的,本篇论文的基本思想如下。已有的研究成果表明非参回归函数的普通核估计量可以近似的表示为,从上面的表达式我们发现了一种新的回归关系,r(χ)可以看做(?)hj(x)对hj回归的截距项,因此我们可以重构线性回归模型并通过加权最小二乘法得到r(χ)的估计。新的估计量结构简单并且尽管不使用高阶核仍然具有较小的均方误差。结果如下,最优窗宽的阶数为O(n-1/9)。进而我们发现虽然采用的窗宽hj不是最优的,但在满足条件hj=O(n-α)且有1/10<α<1/5成立时,新估计量(?)(χ)仍具有比普通核估计量更小的均方误差。由此说明新估计量对窗宽选择稳健。此外,在一些正则条件下,我们还得到了新估计量的渐近正态性,因此,论文第二章中通过联合非参回归和参数回归提出的两步估计(三步估计)能够就窗宽选择和收敛速度的意义上改进非参数估计。更一般的,我们的方法可以推广到一般的非参估计以及非参数回归模型,例如我们还把此方法推广到了多元非参回归模型,可加模型。受Senturk and Muller (2005)提出的协变量调整回归(covariatc-adjusted regres-sion (CAR))问题和另一实际问题(在研究钙缺乏的问题中,需要研究钙吸收量和钙摄取量之间的关系,同时还要考虑体征指标(body mass index)和年龄因素的影响)的启发,在第三章我们介绍并深入研究了协变量调整部分线性模型(covariate-adjusted partially linear models (CAPLM)),其中真实的响应变量Y和预测向量X是观测不到的,我们只能观测到它们被乘子φ(U)和φr(U)污染以后的变量(?)和(?),同时还考虑了时间T的影响。虽然我们的模型看起来像是Senturk (2006)提出的协变调整变系数模型(covariate-adjusted varying coefficient models (CAVCM))的特例,但实际上CAPLM和CAVCM所处理数据的类型有着本质的不同。在某一固定观测时刻有来自多个个体的观测是Senturk (2006)第一步估计方法的关键,而我们所研究的数据在固定观测时刻则可能仅有一个观测。因此,两种模型的推断方法是不同的。如Cui et al (2008)指出,由此我们可以给出(?)(U)和φ(U)的非参估计,并近似恢复真实的不可观测的Y和X。接下来,用恢复的数据来替换不可观测的真实数据,通过截面最小二乘法可以给出参数β的估计。并且,在一些温和的条件下我们还得到了参数估计量的渐近正态性,细节可参看3.3节。此外,我们还给出了回归系数的置信域。随着科技的发展,人们获取和存储高维数据集(即变量的个数p相当或者远大于样本容量n)变得更加方便。变量选择在高维数据分析中发挥着至关重要的作用,Dantzig selector是线性和广义线性模型变量选择方法中的一种。在第四章我们将研究部分线性模型的Dantzig selector变量选择问题,它的定义如下,其中(?)和(?)分别为中心化的设计阵和中心化的响应观测矩阵。我们得到了Dantzig selector的大样本性质。即n趋于无穷,p固定时,在合适的条件下有(?),其中β0为优化问题的解。我们还注意到Dantzig selector并不一定是相合的。为了克服此不足,我们采用Dicker和Lin(手稿)提出的adaptive Dantzig selector变量选择方法.部分线性模型adaptive Dantzig selector定义为,进而,我们得到在合适的条件下部分线性模型adaptive Dantzig selector参数估计量具有oracle性质.即n趋于无穷,p固定时,在特定条件下有adaptive Dantzig selector估计量是模型相合的,并且有Adaptive Dantzig selector作为Dantzig selector的一般形式,它们都可以采用James et al. (2009)提出的DASSO算法来解决最优化问题。文章还讨论了调整参数和窗宽的选择方法。综上所述,本篇论文进一步研究了非参和半参回归模型的相关问题。首先,对非参回归模型,我们提出了一种稳健的纠偏估计方法,新的两步(三步)估计量对窗宽选择稳健,并且不用高阶核就具有比普通核估计更快的收敛速度,均方误差阶数为O(n-8/9)。其次,我们研究了协变量调整的部分线性模型,给出了模型的推断方法,并且得到了参数部分估计量的渐近正态性和置信域。最后我们研究了高维部分线性模型的变量选择和参数估计问题。当样本容量n趋于无穷,变量个数p固定时我们研究了Dantzig selector参数估计量的大样本性质,并得到了adaptive Dantzig selector参数估计量的oracle性质。模拟实验和实际数据的应用进一步阐释了文中介绍的各种方法。

【Abstract】 Nonparametric and semiparametric regression models are well developed and pop-ularly used models for their flexibility and/or interpretability in identifying the regres-sion structure between the response variable and predictor variables. Among semi-parametric models, partially linear model is a class of commonly-used model which is flexed enough and well interpretable. It allows easier interpretation of the effect of each variable and may be preferred to a completely nonparametric regression because of the well-known "curse of dimensionality". Recently, in real medical data analysis covariate-adjusted model and variable selection problems are very popular and have received much attention. However, the common kernel methods are sensitive to the bandwidth and can not achieve a satisfactory convergence rate in nonparametric regres-sion setting, estimation for covariate-adjusted partially linear models lacks of studying and limited work has been done on variable selection for partially linear models as noted in Fan and Li (2004). In this thesis we will focus on these problems mentioned above which are related to nonparametric and semiparametric regression models. More specifically, the motivation and the basic ideas of this thesis are as follows.It has been shown that the common kernel estimator for nonparametric regression function can be approximately expressed as From the above representation we find a new regression rule, i.e., r(x) can be regarded as the intercept by regressing (?)hj(x) on hj, so we can rebuild a linear regression model then get the estimator of r(x) by weighted least squares method. The newly proposed estimator has a simple structure and can achieve a smaller mean square error without use of the higher order kernel. We obtain and the optimal bandwidth is the order of O(n-1/9). Further, we find that if the bandwidths hj are not optimally selected but satisfy the following mild condition: hj=O(n-α) with 1/10<α<1/5, the new estimator r(x) still has smaller mean square error than the original one does. This means that the new estimator is robust to the bandwidth. Besides, under some mild conditions we obtain the asymptotic normality of the new estimator as follows, Thus the two-stage (or three-stage) regression estimation proposed in Chapter 2 by combining nonparametric regression with parametric regression can improve nonpara-metric estimation in the sense of both selection of bandwidth and convergence rate. More generally, this new method is also suitable for general nonparametric regression models regardless of the dimension of explanatory variable and the structure assump-tion on regression function, for example, it is extended to the estimation of multivariate nonparametric regression model and additive models.Motivated by covariate-adjusted regression (CAR) proposed by Senturk and Muller (2005) and an application problem which is to investigate the relationship between cal-cium absorption and calcium intake in addressing the problem of calcium deficiency where effects of body mass index and age are considered, in Chapter 3 we introduce and investigate a covariate-adjusted partially linear regression model (CAPLM) defined below, in which both response Y and predictor vector X can only be observed after being distorted by some multiplicative factorsψ(U) andφr(U) respectively, and an additional variable such as age or period T is taken into account. Although our model seems to be a special case of covariate-adjusted varying coefficient model (CAVCM) given by Senturk (2006), the data types of CAPLM and CAVCM are basically different. Observed measurements at a fixed time coming from different subjects which is a key issue enabling the application of CAR in the first step of the proposed estimation procedure in Senturk (2006), however the data we concerned might only consist of one observation at a fixed time. Then the methods for inferring the two models are different. As is shown by Cui et al (2008), we have As a result, we can construct the nonparametric estimators forψ(U) andφr(U). Then the true unobserved values of Y and X can be approximately recovered. Consequently, by replacing the true data with the recovered ones,βcan be estimated by the pro-file least squares method. Furthermore, under some mild conditions, the asymptotic normality of estimator for the parametric component is obtained, details can be seen in Section 3.3. Combined with the consistent estimate of asymptotic covariance we obtain confidence intervals for the regression coefficients.With the development of technology, people can easily obtain and store high di-mensionality data sets with the number of variables p comparable to or much larger than the sample size n. Variable selection plays an import role in the high dimen-sionality data analysis, among which the Dantzig selector performs variable selection and model fitting for linear and generalized linear models. In Chapter 4 we focus on variable selection for partially linear model via the Dantzig selector which is defined as, where X and Y are centered design matrix and centered response observations respec-tively. The large sample asymptotic properties of the Dantzig selector estimator are studied. When n tends to infinity while p is fixed, under some appropriate conditions, we haveβ(?)β0, whereβ0 solves We see that the Dantzig selector might not be consistent. To remedy this drawback, we take the adaptive Dantzig selector following Dicker and Lin (manuscript) defined as Moreover, we obtain that the adaptive Dantzig selector estimator for the parametric component of partially linear models also has the oracle property under some appro-priate conditions, i.e., assume all the regularity conditions hold and when n tends to infinity and p is fixed, we have the adaptive Dantzig selector estimatorβis consistent for model selection and As generalizations of the Dantzig selector, both the adaptive Dantzig selector and the Dantzig selector optimization can be implemented by the efficient algorithm DASSO proposed by James et al. (2009). Choices of tuning parameter and bandwidth are also discussed.In summary, we study the nonparametric and semiparametric regression models further. Firstly, we proposed a robust and bias-corrected estimator in nonparametric regression setting. The new two-stage (or three-stage) estimator has the mean square error with the order of 0(n-8/9) and is robust to the bandwidth selection. Secondly, we investigate the covariate-adjusted partially linear models, further, under some mild conditions, the asymptotic normality of the estimator and the confidence interval for the parametric components are obtained. Finally, we explore the issues on variable selection and parameters estimation for partially linear models. When the sample size n tends to infinity and the number of predictor variables p is fixed, the large sample asymptotic properties of the Dantzig selector estimator for the parameters are studied and the oracle properties of the adaptive Dantzig selector estimator are obtained under some appropriate conditions.Also some simulations and real data analysis are made to illustrate the new meth-ods.

  • 【网络出版投稿人】 山东大学
  • 【网络出版年期】2010年 08期
节点文献中: