节点文献

心血管病流行病调查中缺失数据填补方法的比较及模拟研究

【作者】 解东方

【导师】 李卫;

【作者基本信息】 北京协和医学院 , 流行病与卫生统计学, 2014, 博士

【摘要】 目的心血管疾病是世界范围内严重危害人类健康的疾病,近年来研究显示,其发病率和死亡率在发展中国家日益增高,针对这一类慢性疾病,很多大规模的流行病学调查研究开展起来,为心血管病的预防提供了新线索和大样本数据的证据。然而,由于人的社会属性和心理特点,常常导致一些科研资料存在不完整数据的情况,即存在缺失数据。对于缺失比例在一定范围内的数据,过去的做法多是直接删除,这种直接的做法虽然简单易行,但却会减少观测的样本量,从而影响分析结果的检验效能。近些年,插补类方法正得到越来越多专家和学者的认可,相应的新方法发展迅速。本研究利用单一插补和多重插补等方法处理缺失数据,重点对多重插补类方法之间的差别进行比较,期望寻找到适用于常规慢性流行病学调查研究中缺失数据的填补策略与方法。方法以心血管病领域的一个大样本、多变量数据集为基础,采用蒙特卡洛技术,按照完全随机缺失机制模拟该数据集在5%、10%、20%、30%四种缺失比例下,单个不同类型变量(包括连续变量、二值变量、有序变量和名义变量)的缺失情形,以及单调缺失模式两个变量缺失,或任意缺失模式两个变量缺失情形。每种缺失情形模拟500次。每次模拟中,分别采用单一插补、联合模型(joint modeling, JM)多重插补策略、全条件定义(fully conditional specification, FCS)多重插补策略对缺失后的数据集进行处理。然后,收集各次模拟时不同方法的处理效果评价指标取值,并对这些取值进行汇总分析,比较这些方法的处理效果。结果对于单变量缺失而言,联合模型(joint modeling, JM)多重插补策略对缺失的单个连续变量插补时,可获得最为接近完整数据集的整体均数;联合模型(joint modeling, JM)多重插补策略对缺失的单个名义变量插补时,可获得对缺失个体值最高的插补正确率。但全条件定义(fully conditional specification, FCS)多重插补策略,则在对单个连续变量个体缺失值的插补方面精确度更高,插补后模型的参数偏差也更小;且全条件定义(fully conditional specification, FCS)多重插补策略对单个二值变量个体缺失值的插补方面精确度方面也更高。对单个缺失的分类变量而言,判别分析法插补正确率高于logistic回归插补法。就多重插补次数而言,单个缺失的连续变量,插补15次效果最好,但10次以上效果提升幅度有限:单个缺失的二值变量、名义变量,插补5次效果最好。对于单调缺失模式多变量缺失,联合模型(joint modeling, JM)多重插补策略对个体缺失值的插补方面精确度高于全条件定义(fully conditional specification,FCS)多重插补策略。在连续变量与二值变量、连续变量与有序变量、连续变量与名义变量单调缺失的插补中,全条件定义(fully conditional specification, FCS)多重插补策略对连续变量在个体缺失值的插补精确性方面高于联合模型(joint modeling,JM)多重插补策略,但联合模型(joint modeling, JM)多重插补策略对分类变量的插补正确率高于全条件定义(fully conditional specification, FCS)多重插补策略。对于任意缺失模式多变量缺失,在连续变量与名义变量缺失的插补中,预测均数匹配法(regpmm)与判别函数法(discrim)联用,对连续变量在个体值的插补精确度上更好,对名义变量的插补准确率也较高。四种缺失比例情形综合考量,FCS(regpmm+discrim)插补5次处理效果整体最好。结论本研究以心血管病研究领域的一个大样本完整数据集为基础,采用模拟缺失的方法,构造了不同类型变量缺失情况。对于单个变量缺失,联合模型(joint modeling,JM)多重插补策略适用于名义变量,而全条件定义(fully conditional specification,FCS)多重插补策略适用于二值变量和连续型变量;对于单调缺失模式多个连续变量缺失,联合模型(joint modeling, JM)多重插补策略精度更高,对于既有连续变量又有离散变量缺失,联合模型(joint modeling, JM)多重插补适用于其中连续变量,全条件定义(fully conditional specification, FCS)多重插补策略适用于其中离散变量;对于任意缺失模式多变量缺失,全条件定义(fully conditional specification,FCS)多重插补策略精度较高。

【Abstract】 ObjectiveCardiovascular disease is a serious disease to human health worldwide. Recent studies have shown that the incidence and mortality were increasing in developing countries. For this chronic disease, many large-scale epidemiological researches carried out,and provided new clues and evidence of a large sample for the prevention of cardiovascular disease. However, due to the social and psychological characteristics of people, there was a number of incomplete data in the scientific information, named missing data. For the proportion of missing data within a certain range, the past approach was deleting the data directly. While simple, but it will reduce sample of observations, and affect the test power of results. In recent years, the imputation methods were recognized by more experts, and developed rapidly. In this study, single and multiple imputation methods are applied for handling missing data, focused on the differences between many multiple imputation methods, and we expect to find appropriate methods and strategies for chronic epidemiological studies.MethodsWe took Jmte Carlo techniques to simulate the different types of single variable (including continuous variables, binary variables, ordinal variables and nominal variables) missing at random, two variables jmotone missing, or two variables random missing at5%,10%,20%, and30%missing proportions, based on a large sample of cardiovascular disease and multivariate data sets. We simulated500times in each scenario deletion. In each simulation, were used delete method, a single imputation method, joint modeling multiple imputation method, and FCS multiple imputation method for missing data set after processing. Then, collected evaluated values of different methods in each time, and compared treatment effects.ResultsFor single variable missing, the joint modeling multiple imputation method can get overall mean value closed to complete data set if it was single continuous variable missing; If it was a single nominal variable missing, jmotone joint modeling imputation method may get the highest correct rate for the missing individual. But FCS multiple imputation method can get greater accuracy and smaller parameter deviation for single continuous variable missing, and the same to a single binary variable missing. For a single categorical variable, the discriminant analysis method was better than the logistic regression imputation method. To multiple imputation times, the imputation15times were the best, but more than10times the effect enhanced limited for single continuous variable missing; single missing binary variables and nominal variables,5times were best.For jmotone multivariate missing, joint modeling multiple imputation method was better than FCS multiple imputation method. In binary variable and continuous variable, ordinal variable and continuous variable, nominal variable and continuous variable imputation, FCS multiple imputation method had higher accuracy than joint modeling multiple imputation method for continuous variable, but joint modeling imputation multiple imputation method had higher correct rate to another categorical variable.For random multivariate missing, in continuous variables and nominal variables missing imputation, regpmm and discrim associated had high accuracy for continuous variables and nominal variable. For four kinds of situations,5times FCS (regpmm+discrim) imputation were best.ConclusionIn our study, we used simulation methods to construct different types of variable missing. For a single variable missing, joint modeling multiple imputation method was suitable for nominal variables, and FCS multiple imputation method adapt to binary variables and continuous variables; for jmotone multiple continuous variables missing, jmotone joint modeling imputation can get higher accuracy; for both continuous variables and discrete variables missing, joint modeling multiple imputation applied to continuous variable and FCS multiple imputation method was suitable for discrete variables; for multivariate random missing, FCS multiple imputation can get higher precision.

  • 【分类号】R54
  • 【被引频次】1
  • 【下载频次】122
节点文献中: 

本文链接的文献网络图示:

本文的引文网络