

Statistical Methods and Applications of Nine RRT Models from a (Stratified) Three-stage Sampling Design Dealing with Sensitive Questions

【摘要】 目的抽样调查在医学科研领域是非常重要的研究方法。在实际应用中,若关心的特征或变量是具有高度隐私或难以在公开场合表态、陈述的敏感性问题,采用传统的调查方式,例如直接询问法、观察法等,部分调查对象出于自我保护的心理需求拒绝回答或故意错答,会产生无应答偏倚或者说谎偏倚,从而调查结果难以真实反映总体的状况和特征。1965年,美国统计学家Warner利用随机化装置成功实现了在有效地保护调查对象隐私的前提下得到二项选择敏感性问题的统计数据,开创了随机化应答技术(Randomized Response technique,RRT)的先河。自Warner开始,国内外的统计学者沿着随机化的思路不断探寻,在四十多年的时间里,提出一系列Warner模型的改进模型以及一些新的调查方法,推动了敏感性问题抽样调查的快速发展。然而在本课题组研究之前,国内外研究较多的敏感性问题类型为二分类敏感性问题和数量特征敏感性问题,对于多分类敏感性问题关注较少;研究较多的敏感性问题抽样调查的统计方法主要局限于简单随机抽样,实际应用也主要局限于小范围内特殊人群小样本的简单随机抽样调查,或在大规模调查中将复杂抽样调查方法获取的敏感性问题资料误用简单随机抽样调查有关公式来统计分析。应用抽样调查进行科学研究时,抽样设计是抽样调查的重要内容,样本量的估计是抽样设计的关键步骤。但对于敏感性问题随机应答模型在各种复杂抽样方法下如何估计各阶段的样本量,相关文献报道较少。因此,本文拟对9种随机应答模型与实际调查中常常采用的三阶段、分层三阶段抽样2种抽样方法组合的18种调查方法,在给出敏感问题特征相关统计量计算公式的基础上,当限定抽样误差的大小使调查费用达到最小及限定调查费用的大小使抽样误差达到最小两种情况下,推导出估计敏感问题特征总体比例、总体均数时的最优样本量计算公式,为适用于各类型敏感性问题较大规模调查的复杂抽样方法提供了科学的样本量估计公式;通过初步调查分析西昌市女性性工作者(Female Sex Workers,FSW)人群敏感问题特征,为掌握西昌市性病、艾滋病的流行状况提供了初步的数据资料,并估计出样本量计算公式中相关统计量的数值;针对本团队拟于2015年开展的国家自然科学基金项目(编号:81273188)研究中西昌市女性性工作者人群敏感问题特征的三阶段抽样调查,按照本文所推导的计算公式估计出各阶段的样本量,完成了该国家自然科学基金项目研究中的调查设计;为科学估计艾滋病高危行为人群敏感性问题的特征提供统计方法,为卫生行政部门制订预防控制性病、艾滋病策略、规划、措施提供科学依据。分别对3种RRT模型与三阶段、分层三阶段抽样组合的6种调查方法:以西昌市实际调查的统计量数值作为模拟总体参数,采用SAS编程建立模拟总体并做模拟抽样预调查100次,利用本文推导的样本量计算公式,估计出模拟抽样正式调查时所需最优样本量,按照估计出的最优样本量,做模拟抽样正式调查100个样本,利用本文推导的有关统计公式对100个正式模拟调查样本进行总体参数的点值估计和区间估计,通过与模拟总体参数的比较,来评价本文研究的调查方法及其统计量、最优样本量计算公式的信度与效度。方法一、对二项选择敏感性问题Warner随机应答模型、二项选择敏感性问题Simmons随机应答模型、二项选择敏感性问题双无关问题模型和二项选择敏感性问题改进的随机应答模型,多项选择敏感性问题单一样本随机应答模型、多项选择敏感性问题随机间接应答模型以及数量特征敏感性问题无关联问题模型、数量特征敏感性问题加法模型、数量特征敏感性问题乘法模型9种随机应答模型,与三阶段抽样、分层三阶段抽样2种抽样方法组合的共18种调查方法,根据Cochran的抽样理论、全概率公式、均数与方差的基本性质等概率论与数理统计学理论方法,给出了敏感性问题总体比例、总体均数的估计量及其方差与估计方差的计算公式。二、分别对以上18种调查方法,在限定抽样误差的大小使调查费用达到最小及限定调查费用的大小使抽样误差达到最小两种情况下,使用哥西不等式、求条件极小值点等高等数学与高等代数的理论方法,推导估计敏感性问题特征总体比例、总体均数时各层各阶段的最优样本量计算公式。三、设计RRT的随机化装置,制定敏感性问题的调查问卷,在三阶段抽样下采用二项选择敏感性问题Simmons模型、多项选择敏感性问题随机应答模型、数量特征敏感性问题加法模型,于2011年5月至2011年7月对西昌市女性性工作者人群的10个敏感性问题进行了初步调查;并按本文给出的统计公式,初步统计分析调查资料,并估计出样本量计算公式中的有关统计量的数值。四、根据本文推导的敏感性问题复杂抽样方法下的样本量计算公式和初步调查得到的相关统计量的数值,对二项选择敏感性问题Simmons模型、多项选择敏感性问题随机应答模型、数量特征敏感性问题加法模型共10个敏感性问题的三阶段抽样,分别计算在限定抽样误差的大小使调查费用达到最小时所需各阶段最优样本量和限定调查费用的大小使抽样误差达到最小时所需各阶段最优样本量。五、按照西昌市女性性工作者人群敏感特征实际调查的样本比例、或各类别样本比例、或样本均数作为总体参数,基于蒙特卡洛方法,用SAS编程建立模拟总体。分别对二项选择敏感性问题Simmons随机应答模型、多项选择敏感性问题单一样本随机回答模型以及数量特征敏感性问题加法模型3种RRT模型与三阶段、分层三阶段2种抽样方法组合的6种敏感性问题调查方法:用SAS编程模拟抽样预调查100个样本,按本文推导的最优样本量估计公式,计算100组模拟抽样正式调查时所需各层各阶段的样本量;按照估计出的100组最优样本量,SAS编程实现模拟抽样正式调查100个样本;按本文推导的相应统计公式计算总体比例、各类别总体比例或总体均数的估计量及其估计方差;分别对100个模拟抽样正式调查样本,进行总体比例或总体均数的点值估计和95%区间估计,来评价本文研究的调查方法及其统计量、最优样本量计算公式的准确性(效度)和精确度(信度)。若100个95%置信区间几乎都包含总体比例(或总体均数),可认为100个样本比例(或样本均数)几乎均接近总体比例(或总体均数),说明本文研究的(分层)三阶段抽样下RRT调查方法及其统计量、最优样本量计算公式具有良好的效度;又因为100个样本比例(或样本均数)均接近相同水平,同时也说明本文研究的(分层)三阶段抽样下RRT调查方法及其统计量、最优样本量计算公式具有良好的信度。结果一、本文设计了9种随机应答模型与三阶段、分层三阶段2种抽样方法组合的18种调查方法,分别对18种调查方法,给出了敏感性问题总体比例、总体均数的估计量及其方差与估计方差的计算公式。二、本文分别对9种随机应答模型与2种复杂抽样方法组合的18种调查方法,当限定抽样误差的大小使调查费用达到最小及限定调查费用的大小使抽样误差达到最小两种情况下,推导出估计敏感问题特征总体比例、总体均数的所需各阶段最优样本量计算公式。三、本文对西昌市女性性工作者人群进行了敏感性问题3种随机应答模型下三阶段抽样的10个敏感性问题的初步调查。初步调查分析结果如下:西昌市女性性工作者首次性服务的年龄均数为21.45岁,标准误为0.8162岁;月人均性服务的次数为41.66次,标准误为1.4550次;性服务的次均费用为213.67元,标准误为8.2475元;除收费的性服务对象外,拥有配偶或其它固定性伴侣的比例为55.94%,标准误为3.87%;被医生诊断患某种性病后停止从事性服务的比例为75.85%,标准误为3.00%;同意卖淫合法化的比例为56.77%,标准误为4.12%;最近一年性病检测结果没有检查过、无性病、有性病的比例分别为62.12%、21.36%、5.57%,标准误分别为4.00%、3.87%、2.24%;最近一年艾滋病没有检查过、检查结果阴性、检查结果阳性的比例分别为57.11%,23.54%,2.35%,标准误分别为4.00%,4.00%,1.00%;最近一次性服务时,安全套有破损的比例为8.27%,标准误为2.65%;最近一个月性服务时,从未全程使用安全套、有时全程使用安全套、一直全程使用安全套的比例分别为11.40%、14.21%、74.40%,标准误分别为2.65%、3.16%、4.69%。四、针对本团队拟于2015年开展的国家自然科学基金项目(编号:81273188)研究中西昌市女性性工作者人群敏感问题特征的三阶段抽样调查,根据本文推导的公式,综合10个敏感性问题的初步调查结果,估计出抽样调查各阶段所需样本量。第一阶段需随机抽取的区数n15(个),第二阶段平均每个区需随机抽取的活动场所数n26(个),第三阶段平均每个区每个活动场所需随机抽取的女性性工作者人数n329(人)。五、对3种RRT模型与三阶段、分层三阶段2种抽样方法组合的6种敏感性问题调查方法,进行100次计算机模拟抽样预调查和模拟抽样正式调查,模拟抽样正式调查结果如下:1.模拟二分类敏感性问题Simmons模型分层三阶段抽样调查FSW人群拥有配偶或者固定性伴侣的比例,由样本统计量和样本方差推断的100个总体比例的95%可信区间有96个包含总体比例,且各样本比例均接近总体比例(模拟真值),说明本文研究的二分类敏感性问题Simmons模型下(分层)三阶段抽样的调查方法、统计公式及最优样本量计算公式具有良好的信度与效度。2.模拟多分类敏感性问题单一样本RRT模型分层三阶段抽样调查FSW人群性服务时全程使用安全套的情况,模拟结果显示:对于类别一(从未全程使用安全套),100次模拟中有97次得到的该类别总体比例95%可信区间包含其总体比例。对于类别二(有时全程使用安全套),100次模拟中有97次得到的该类别总体比例95%可信区间包含其总体比例。对于类别三(一直全程使用安全套),100次模拟中有96次得到的该类别总体比例95%可信区间包含其总体比例。说明本文研究的多分类敏感性问题单一样本RRT模型下(分层)三阶段抽样的调查方法、统计公式及最优样本量计算公式具有良好的信度与效度。3.对数量特征敏感性问题加法模型,100个总体均数的95%可信区间有99个都包含总体均数,且各样本均数都接近总体均数(模拟真值),说明本文研究的数量特征敏感性问题加法模型下(分层)三阶段抽样的调查方法、统计公式及最优样本量计算公式具有良好的信度与效度。结论一、本文对敏感性问题9种随机应答模型与三阶段、分层三阶段2种抽样方法组合的共18种调查方法,给出了敏感问题特征总体比例、总体均数的估计量及其估计方差的统计公式,并将其中三种随机应答模型的三阶段抽样调查方法成功应用于西昌市女性性工作者这一性病、艾滋病高危人群的敏感性问题预调查,并取得了满意的实际应用效果。说明本文提供的调查方法及其统计公式科学、可靠、有效、实用性强、适用范围广,具有较广阔的应用前景和重要的应用价值。二、初步调查结果提示西昌市女性性工作者人群存在每月进行性服务的次数较多、每次性服务收费偏低、同意卖淫合法化比例高、从未到正规医疗机构进行性病与艾滋病检查比例高、性服务时安全套破损比例高等性病、艾滋病高危行为方式,性病、艾滋病防治工作形势仍不容乐观,应引起政府、卫生部门的高度重视,寻找合理的应对措施,把预防与控制性病、艾滋病工作作为一项刻不容缓、复杂而长期的艰巨任务,为人民群众创造安全的卫生环境。三、本文对敏感性问题9种随机应答模型与三阶段、分层三阶段2种抽样方法组合的共18种调查方法,在限定抽样误差的值使调查费用达到最小及限定调查费用的值使抽样误差达到最小两种情况下,推导出估计敏感问题特征总体比例、总体均数的各层各阶段最优样本量计算公式,为敏感性问题的抽样调查设计提供了科学的新方法。采用3种随机应答模型三阶段抽样调查方法及本文推导的样本量计算公式,对西昌市女性性工作者人群敏感特征调查估计出各阶段所需的最优样本量,具有积极的推广意义和广泛的应用价值。四、分别对3种RRT模型与三阶段、分层三阶段2种抽样方法组合的6种敏感性问题调查方法,先进行100个样本的计算机模拟抽样预调查估计样本量,再进行100个样本的模拟抽样正式调查,模拟抽样正式调查结果显示100个95%置信区间几乎都包含总体参数,说明本文研究的敏感性问题调查方法及其统计量、最优样本量计算公式具有良好的信度和效度。

【Abstract】 Objective:Sampling survey is usually referred to as an essential method for scientific researchin medicine and health. Almost inevitably, sampling survey related to sensitive subjectsis encountered. Getting truthful answers to survey questions about sensitive matters is achallenge. Sensitive topics are perceived as threatening to preserving privacy or makingany public statement. The direct inquiry method often leads to refusals or untruthfulreplies. Untruthful reporting is social desirable or undesirable responding not accurateresponse. Answers to sensitive questions are distorted by nonresponse bias or lying bias.The randomized response technique (RRT) was first conceived by Warner in1965andintroduced as a method for guaranteeing respondents to maintain privacy and improvingthe accuracy of estimates about sensitive dichotomous characteristic.Since Warner published his first paper on randomized response, many researchershave improved and further developed this technique. Efforts have been made to proposevarious forms of RRT during the last forty years. In research of RRT, much attention hasbeen paid to both dichotomous and quantitative sensitive questions. Correspondingly,there is less study on the polychotomous sensitive questions. Simple random samplingis the most widely used for surveys on sensitive topics. Sampling strategy is usuallyrestricted on simple random sampling when surveys are dealing with sensitive topics.Respondents, invited to participate in research on sensitive topics, are always confinedto small area and are usually drawn by simple random sampling. What is more, onlysimple random sampling might be taken into account in the analysis of data fromcomplex sampling survey. The design of sampling is a particular important aspect of sampling survey; determination method of sample size is the key link in samplingdesign. However, sample size determination for complex sampling survey in sensitivetopics using randomized response model is not yet available.The selection of eighteen survey methods, which constituted varied combinationsof nine randomized response model and two sampling methods, was in the presentresearch. Based on the premise that the estimators of the population parameters forrandomized response model in (stratified) three-stage sampling survey were given,sample sizes formulae for (stratified) three-stage sampling survey were deduced in thisstudy, so as to minimize the cost of survey implementation for a specified level ofprecision and to provide reasonably precise estimates under the constraint of a fixedbudget. These formulae were suitable for complex sample survey on a large scale.Preliminary investigation into sensitive behaviors among female sex workers (FSW) inXichang showed a meaningful trends in sexually transmitted diseases (STD)/acquiredimmunodeficiency syndrome (AIDS) and provided related statistic value needed forsample size formulae. According to the deduced formulae in this research, requiredsample size at each stage was calculated to estimate sensitive features of FSW inXichang in a field investigation which will be carried out in2015. As to six surveymethods, which were comprised of varied combinations of three randomized responsemodel and two sampling methods, we built sampling simulation with SAS programsbased on Monte Carlo method. According to sample size calculated based onpreliminary simulation experiments, simulated survey was conducted to estimate thevalue of simulated population parameters using point estimate and interval estimate. Wecompared this value with predetermined simulated population proportion/mean so as toevaluate the validity and reliability of survey methods and statistical formulae andsample size formulae.Method:1. Statistical formulae used in eighteen survey methods were showed whichconsisted of varied combinations of nine randomized response model (e.g. Warner RRTmodel, Simmons RRT model, Greenberg RRT model, improved RRT model,multiple-choice sensitive question with single response RRT model, multiple-choicesensitive question with indirect response RRT model, unrelated question RRT model,additive constant model and multiplicative RRT model) and two sampling methods (e.g. three-stage sampling and stratified three-stage sampling). These statistical formulaewhich were some ways of producing estimators of population proportion/mean andpopulation variance had been deduced from Cochran’s sampling theory as well asstatistics and probability theory (e.g. total probability theorem, etc).2. By the Cauchy-Schwarz inequality and the Lagrange function, to theminimization of cost for specified sampling errors and the minimization of samplingerrors under the constraint of a fixed cost, the formulae for the optimum sample size forthree-stage and stratified three-stage sampling survey were deduced.3. Randomizing devices in RRT models were designed and questionnaire itemsasking sensitive topics needed to be crafted. Behavioral characteristics wereinvestigated in three-stage sampling study of FSW in Xichang from May to July in2011.Following the statistical formulae, we conducted a preliminary analysis to estimaterelated statistics value needed in the sample size formulae.4. For all of ten sensitive questions in three-stage sampling survey research usingthree RRT model (e.g. Simmons RRT model, multiple-choice sensitive question withsingle response RRT model and additive constant model), optimum sample size weredetermined when we limited sampling errors for the minimum of cost and limited costfor the minimum of sampling errors respectively. All these were based on the bothrelated statistics value and sample size formulae deduced in this study.5. Based on the survey results on the behavioral characteristics of FSW, simulatedpopulation was built with SAS program. Sample mean/proportion or sample proportionin the different categorize from survey data collected in Xichang was considered assimulated population parameter. We simulated a stratified three-stage sampling processand then select three RRT model (e.g. Simmons RRT model, multiple-choice sensitivequestion with single response RRT model and additive constant model) to investigatevirtual FSW. This process was called simulated sampling preliminary survey. Followingthe sample size formulae, SAS gave sample sizes within each stratum at each stagewhich were needed for simulated sampling formal survey. We simulated a stratifiedthree-stage sampling once more according to the calculated sample size, that is,simulated sampling formal survey. On the basis of the statistical formulae, we calculatedsimulated sample statistics and then computed95%confidence interval (CI) ofsimulated population proportion/mean or simulated population proportion in differentsensitive categorizes. This process was repeated100times. If almost all100CIs include the predetermined population proportion/mean, survey methods, statistical formulae andsample size formulae proved to be strong validity. When100simulated sample statisticsalmost got close to a fixed value (predetermined population proportion/mean was truevalue), survey methods, statistical formulae and sample size formulae indicated a highdegree of reliability.Results:1. This study proposed eighteen types of survey methods which were combinationof nine RRT models and two sampling methods. For each survey method, formulae forestimators of population proportion/mean and population variance were given.2. The formulae for the optimum sample sizes with eighteen types of surveymethods were deduced when the cost was minimized for specified sampling errors andthe sampling errors was minimized under the constraint of a fixed budget.3. Using three types of RRT models from a three-stage sampling survey toinvestigate behavioral features of FSW in Xichang, the preliminary survey resultsshowed as follows: FSW provided their first paid sex services in the age21.45years,with the standard error of0.8162; FSW provided sex services41.66times a month, withthe standard error of1.4550; the average price per sex service was213.67RMB, withthe standard error of8.2475; the proportion of having spouse or steady sex partner was55.94%, with the standard error of3.87%; the proportion of ceasing trading as a sexservice provider when FSW were diagnosed as STD was75.85%, with the standarderror of3.00%; the proportion of supporting legal prostitution in China was56.77%,with standard error of4.12%; the proportion of results on STD test, including failure toget STD test in government hospitals, negative cases and positive cases, were62.12%,21.36%,5.57%in the most recent year, with the standard error of4.00%,3.87%,2.24%respectively; the proportion of results on HIV test, including not being tested for HIV,negative and positive, were57.11%,23.54%,2.35%within one year, with the standarderror of4.00%,4.00%,1.00%respectively; the proportion of FSW reported condombreaking during the act was8.27%, and the standard error was2.65%; the proportion ofcondom usage, which was classified into never used, sometimes used and always used,were11.40%,14.21%,74.40%last month, with the standard error of2.65%,3.16%,4.69%respectively.4. National Natural Science Fund Project will be launched in2015to conduct three-stage sampling survey concerning sensitive characteristics of FSW in Xichang.Taking preliminary survey data into consideration, required sample size at each stagewere shown. In the first stage, five districts should be selected (n15). Then in thesecond stage, an average of six venues should be drawn from each chosen district(n26). Finally, in the third stage, twenty nine FSW, on average, should be sampledfrom each chosen venue (n329).5. The simulation on6types of methods for surveying sensitive topics displayedthe following results:5.1For Simmons RRT model applied to dichotomous sensitive questions in astratified three-stage sampling survey,96CIs for simulated population proportion from100CIs contained predetermined simulated population proportion (from preliminarysurvey and accepted true value). Therefore, sampling survey method, statisticalformulae and sample size formulae for Simmons RRT model indicated strong validityand reliability.5.2For multiple-choice sensitive question with single response RRT model from astratified three-stage sampling survey, there were97,97, and96CIs for simulatedpopulation proportion out of100CIs contained the real population proportion of eachsensitive categorize respectively. Thus, sampling survey method, statistical formulaeand sample size formulae for multiple-choice sensitive question with single responseRRT model demonstrated a good degree of validity and reliability.5.3For quantitative additive RRT model under a stratified three-stage samplingdesign,99CIs for simulated population mean from100CIs contained the true value ofpopulation mean of sensitive quantitative characteristics. As a consequence, samplingsurvey method, statistical formulae and sample size formulae for additive RRT modelshowed high validity and reliability.Conclusion:1. Statistical formulae about eighteen survey methods for the estimators ofpopulation proportions/mean and corresponding population variance were given in thisstudy. And three RRT models in a three-stage sampling design were successfullyemployed to preliminarily investigate into FSW those who were at high risk of STD andAIDS in Xichang. Survey methods and statistical formulae proved to be effective and reliable, and got broad perspectives on application.2. We presented the preliminary results. FSW provided sexual services to theirclients multiple times and charged low fees for each sexual service. A great number ofFSW argued that prostitution should be legalized. Many FSW reported that they did notreceive HIV testing and condom had broken during sexual service. Government andhealth authorities should pay more attention to the less optimistic situation and look forsuitable way of settlement.3. Formulae for optimum sample size were deduced to provide sampling surveydesign on sensitive topic with scientific methods. Optimum sample size at each stagewas calculated in order to carry out three-stage sampling formal survey on sensitivefeatures among FSW using three RRT models, which enjoyed a broad prospect ofpopularization.4. For six survey methods, combination of three RRT models and two samplingmethods, sample size was determined on the basis of preliminary simulation data, andthen formal simulated sampling survey was conducted. The simulation results showedthat almost all95%CIs contained true value of population proportion/mean, indicatingsurvey methods, statistical formulae and sample size formulae were accurate andreliable.

