节点文献

生存数据统计模型的变量选择方法

Variable Selection Methods in Statistical models for Survival Data

【作者】 刘吉彩

【导师】 张日权;

【作者基本信息】 华东师范大学 , 概率论与数理统计, 2014, 博士

【摘要】 生存数据广泛出现在生物医学、经济金融、保险精算、可靠性工程等领域。由于生存数据一般都存在删失,完全数据下的统计方法几乎都会失效。因此,如何对其统计分析一直是一个方兴未艾的主题。而且,在许多的实际问题中,往往会观察到多个不同的生存时间,我们称为多元生存时间数据。该数据的主要特点是各类生存时间之间可能是相依性的。由于这种复杂的相依性和删失的存在,使得对多元生存时间数据的统计分析变得比较困难。然而,因其广泛的实用性价值,引起了越来越多学者的关注。随着现代科技的发展,海量数据随处可见,特别是在生物信息、航空航天、人工智能以及电子商务等方面。这些海量数据的特点一般是维数很高、噪声很大。如何从这种高维数据中提取出有用信息是人们最为关心的问题。变量选择作为一种重要的信息提取工具,受到了统计学家们高度的重视。然而,经典的变量选择方法面对如此的高维数据有可能完全失去作用。为此,统计学家提出了各种的改进方法。其中,最为流行的方法就是正则化方法,如LASSO、SCAD以及MCP等。本文主要在生存数据,包括多元生存时间数据框架下研究正则化变量选择方法的三个问题:第一,结构化协变量的选择问题;第二,超高维,即p》n下的变量选择;第三,半参数回归模型的变量选择。在本文的第二章中,基于可加危险率模型我们讨论具有组结构协变量的变量选择问题。研究的目标是同时识别重要的组内和组间变量。为此,我们考虑了一个层次化的惩罚方法。在协变量维数发散情况下,我们证明了所提估计的大样本性质。数值计算结果表明,在协变量具有组结构情况下,该方法优于现有的方法,如LASSO, SCAD和Adaptive LASSO等。最后,我们使用所提方法分析了一组基因数据。本文的第三章主要研究,在协变量的维数p=O(exp(nδ))其中δ>0情况下,可加危险率模型的一类非凸惩罚方法的大样本性质。在类似于Zhao and Yu[97]的不可忽略性条件(Irrepresentable Condition)下,我们证明了所提估计具有强Oracle性质。有趣的是该性质对LASSO同样适合。另外,我们也建立了该非凸惩罚估计(此时不包括LASSO)的渐近正态性。本文的第四章以及第五章基于多元生存时间数据分别考虑部分变系数、部分线性比例危险率回归模型的变量选择问题。对于参数部分协变量的选择和估计,我们主要采用一步回切估计的思想。对于非参部分的重要性识别,主要是通过假设检验完成。在一些正则化条件下,我们分别获得了相应估计的Oracle性质。模拟结果证实所提方法具有很好的变量选择效果。最后,我们分别将该方法应用于结肠癌数据统计分析中。

【Abstract】 Survival data occurs widely in biomedicine, economic and finance, actuarial science of insurance, reliability engineering and other fields. However, due to censoring, it is not suitable to analyze survival data by classical statistical methods of complete data. Therefore, how to make inferences about it is always a burgeoning theme. Moreover, multivariate survival time data arises frequently in many biomedical studies when more than one failure outcome is observed for an individual. A key feature of this type of data is that the survival times may be related to each other for the same subject or cluster. Because of the complex dependence and censorship, inferences about it become nontrivial. However, owing to its wide use in practice, the statistical analysis for multivariate survival time data has attracted more and more attention.With the development of modern technology, mass data has been encountered in many fields, especially biological information, aerospace, artificial intelligence and elec-tronic commerce and so on. Generally, this data behaves very high dimension and noise. How to extract the useful information from such high dimensional data is a fundamental problem. As an efficient tool to mine important information, variable selection has re-ceived great attention by statisticians. However, it is often infeasible to deal with such high dimensional data by classical variable selection methods. Therefore, many improved methods have been proposed. Among them, the most popular methods are the regular-ization methods, such as LASSO, SCAD and MCP etc. In the framework of survival data, including multivariate survival time data, this dissertation addresses the following three questions about the regularization methods:firstly, how to select important variables when covariates have a group structure; secondly, how to carry out variable selection for the settings of the dimension p>> n, where n is the sample size; thirdly, how to identify important variables for a semiparametric regression model.In Chapter2, we discuss the variable selection problem in the additive hazards model where the covariates have been grouped. The aim of this study is to simultaneously identify the important variables between the intra group and inter group. To this end, we consider a hierarchical penalty method. For the case of the diverging dimension, we establish the large sample properties of the proposed method. Numerical results indicate that, when there exits a group structure for the covariates, the hierarchically penalized method outperforms than some existing methods such as the LASSO, SCAD and Adaptive LASSO and so on. Finally, we analyze a gene expression dataset by the proposed method.In Chapter3, we consider the large sample properties for a class of nonconcave penalized procedures in the additive hazards model when the dimension of covariates may grow nonpolynomially with the sample size n, namely, exp(nδ) with δ>0. In the condition similar to Irrepresentable Condition proposed by Zhao and Yu [97], we prove that the proposed estimation behaves strong oracle property. It is interesting to notice that this property holds for the LASSO. In addition, the asymptotic normality has been established, which don’t satisfy for the LASSO penalty.In Chapter4and5, we study the variable selection in the partially linear vary ing-coefficient marginal hazards model and the partially linear marginal hazards model for multivariate survival time data, respectively. For the parametric parts, we mainly use an ideal of the one-step backfitting method. And, the important nonparametric function can be identified through hypothesis testing. Under some regular conditions, we obtain the oracle properties of the corresponding estimations. The simulation results demonstrates that the proposed methods perform well. Finally, we apply these methods to the colon cancer data analysis.

节点文献中: