节点文献

山西省HIV/AIDS结核感染监测资料预测方法研究

Prediction Methods Study of Tuberculosis Infection Monitoring Data of HIV/AIDS in Shanxi Province

【作者】 赵晋芳

【导师】 刘桂芬;

【作者基本信息】 山西医科大学 , 流行病与卫生统计学, 2009, 博士

【摘要】 目的为有效利用中国疾病预防控制中心(CDC)结核病管理信息系统、山西省结核病网络监测数据库和运城地区5个项目防治县已收集的TB/HIV双重感染者监测随访数据(第五轮中国全球基金TB/HIV双重感染项目)的信息,设计、修订并完善能反映山西省结核感染、结核患病和TB/HIV双重感染流行病学监测信息库;对运城地区5个项目防治县HIV/AIDS结核感染数据进行稀有事件logistic回归分析及贝叶斯估计,以进一步揭示运城地区HIV/AIDS结核感染的现状及其影响因素,并与常规分析方法进行对比研究,为制定有针对性的双向预防和防治措施提供参考;同时对传染性最强的涂阳结核病发生、流行及其变化趋势进行预测,为政府及相关部门及时采取有效措施提供依据。方法本课题结合山西省结核病发生特点,在利用中国疾病预防控制中心结核病信息管理系统监测数据库基础上,进一步完善、修订和编制山西省TB/HIV双重感染流行病学监测随访记录表,对山西省运城市第五轮中国全球基金结核病项目芮城、夏县、新绛、绛县、稷山5个项目防治县TB/HIV双重感染者进行监测,为山西省TB/ HIV双重感染的严重程度及影响因素评价提供基础数据。根据TB/ HIV双重感染监测数据特点,采用稀有事件logistic回归、随机效应logistic回归、贝叶斯估计等方法对HIV/AIDS结核感染概率进行估计,并对监测数据评价方法进行对比研究,全部过程采用SAS9.1.3、Stata10.0软件编程实现。利用中国疾病预防控制中心结核病信息管理系统数据,建立山西省2005年-2008年涂阳结核病例时间序列分析ARMA模型和ARIMA模型,Microsoft SQL Server Analysis Services数据挖掘模型-(Microsoft时序算法),对山西省结核病发病趋势进行预测,并对两种模型预测结果进行对比分析。结果1、目前国家结核病信息管理系统运转良好,信息搜集全面。结核病监测信息是进行结核病预防与控制及干预评价的基础数据,其数据质量直接影响评价效果。国家结核病监测信息系统运转良好,但结合山西工作特点,根据省结核病防治办公室工作需求,欲将结核感染,结核患病以及HIV感染等防治信息融合为统一的互联信息。本课题在现有的国家结核病管理信息系统的基础上,完善TB/HIV双重感染防治的内容,在了解山西省多年的疫情监测资料基础上,增加了TB/HIV双重感染随访及死亡登记相关调查内容,进一步完善与修订了基线调查条目,增加了结核病人的收入水平、营养状况、治疗过程记录及其副反应等,并能在监测系统中方便地提供信息整合后的TB/HIV双重感染统计分析数据库,为结核病与艾滋病防治监测数据管理与分析提供了一种模式。2、探讨HIV/AIDS结核感染的影响因素,采用经典logistic回归分析,往往会由于反应变量的两类取值频率相差悬殊(HIV/AIDS结核感染表现为稀有医学事件),而引致不切实际的参数估计,低估稀有事件的发生概率。本课题探讨了稀有事件logistic回归校正参数和概率估计值的方法。通过原理阐述、软件编程,对监测数据分别拟合普通logistic回归、logistic回归先验校正、logistic回归MCN先验校正、logistic回归加权校正和logistic回归MCN加权校正模型,并根据Vuong检验原理编程实现非嵌套模型间的对比分析,结果显示logistic回归MCN加权校正拟合结核双重感染监测数据较好。针对稀有事件发生概率估计问题,采用最大似然估计、加权最大似然估计、近似无偏估计、近似Bayes估计方法,结果显示近似Bayes估计得到的结果最优。根据近似Bayes估计结果可知,山西省的HIV/AIDS结核感染概率约为5.0%。3、由于五个项目监测县HIV/AIDS的结核感染率在地区之间存在组群效应,本课题采用广义线性混合效应模型进行分析,建立HIV/AIDS结核感染的随机效应logistic回归模型来解决同一地区结核感染非独立的问题。对山西省五个项目监测县HIV/AIDS结核感染实例分别采用随机效应logistic回归、稀有事件的随机效应logistic回归和稀有事件的随机效应logistic回归MCN加权校正模型进行分析,由模型拟合评价指标结果可见,稀有事件随机效应logistic回归MCN加权校正模型对数据拟合较好。CD4细胞计数水平可作为HIV/AIDS结核发病概率估计的一个预警因素, HIV/AIDSCD4计数对数值每增加一个单位,HIV/AIDS结核感染的危险性降低74.9%。4、广义线性混合效应模型的参数估计常需要对联合似然函数数值积分或者对模型采用某种近似,限制伪似然估计方法就是对广义线性混合效应模型的一阶泰勒近似。本课题将贝叶斯估计方法引入广义线性混合效应模型的参数估计,结果显示在选取了无信息先验后,贝叶斯估计所得的后验均数与限制伪似然估计结果比较接近。5、山西省2005年到2008年涂阳结核病例时间序列预测模型ARIMA(1,1,0)(1,1,0)12分析结果表明,2009年的新发涂阳病例数较往年会有大幅度降低,2009年3月到8月的涂阳结核病发病例数可能较高,4月份的病例数预计最多,提示结核病防治还应加大力度,重视及时防治。所建模型显示预测值和实际值平均绝对误差为136.64,平均相对误差为8.10%,拟合效果较好。Microsoft时序算法预测结果与ARIMA(1,1,0)(1,1,0)12模型预测趋势一致,各级防疫部门应根据预警信息,进一步加强2009年第二、三季度结核防治工作。6、与ARIMA模型的拟合结果相比,2007年1月至2008年8月,Microsoft时序算法的预测平均绝对误差为116.7,平均相对误差为6.60%,而ARIMA模型的预测平均绝对误差为104.4,平均相对误差为5.90%。Microsoft时序算法的预测结果与ARIMA(1,1,0)(1,1,0)12模型基本一致,预测相对误差均较小,但Microsoft时序算法在2008年9月-12月的预测误差明显大于ARIMA(1,1,0)(1,1,0)12模型。可见在山西省涂阳结核病例预测中以ARIMA模型更好,它不仅可以通过差分运算提取序列中蕴含的季节效应和长期趋势效应等强劲的确定性信息,而且尚可利用随机信息,故其预测精度较高。结论1、经过修订完善的结核感染监测量表,增加了TB/HIV双重感染防治的内容,可以为结核感染与疾病的预防与治疗效果的评价提供更丰富的第一手材料,它是有效控制TB/HIV双重感染,降低HIV感染者中结核病的发病和死亡,减少结核病人中HIV的感染机会,共享结核病与艾滋病防治信息,进行资料联合评价的重要条件。2、对稀有事件数据的分析,无论是模型的参数估计,还是预测预报,稀有事件logistic回归确实更优于普通logistic回归。因此,针对生物医学现象中发病、患病水平较低的很多疾病研究,稀有事件的logistic回归是一种值得推广的应用统计模型。然而实际应用中是否有必要对稀有事件的logistic回归进行参数和预测概率校正,也即是稀有事件logistic回归相对普通logistic回归的模型选择问题。由于两者之间不是嵌套的关系,而是竞争性的非嵌套非线性模型,所以本文首先提出将Vuong检验作为评价模型好坏的检验方法,其原理易于理解,可在SAS软件中编程实现,结果评价合理,能解决实际应用问题,是值得推崇的一种方法。3、在贝叶斯假定下,对参数指定无信息先验,应用MCMC技术进行广义线性混合效应模型参数估计,得到了和限制伪似然估计一致的结果,为广义线性混合效应模型提供了另一种有效的分析途径。随机效应logistic回归的贝叶斯估计相对限制伪似然估计,结果更精确,解释更合理,尤其是在能执行贝叶斯分析的统计软件的支持下,贝叶斯估计更具优势。4、时间序列模型(ARMA模型)可方便地处理平稳序列问题,而实际应用中许多非平稳序列经过差分运算后会表现出平稳序列的性质,ARIMA模型就是差分运算与ARMA模型的结合,对序列数据的拟合效果较好,是传染病尤其是结核病发病趋势预测中实用性较强的数学模型与预测工具。依据模型分析应用条件,选择恰当的分析模型是ARIMA模型保证预测效果的关键。5、本研究首次将Microsoft时序算法引入医学时序资料的统计分析中,构建了山西省2005年1月至2008年12月CDC结核病管理信息涂阳结核病例数训练模型,作为一种新的预测算法,它将自回归和决策树技术结合在一起,丰富了医学时序资料的预测方法。虽然本例中Microsoft时序算法的预测误差略大于ARIMA模型,预测效果的稳健性尚有待继续探讨,但其原理简单易于理解,软件操作方便,便于基层监测数据快速分析,亦不失是一种值得推广学习的新方法。

【Abstract】 ObjectivesCooperated with the Center of Disease Control and Prevention of Shanxi Province, tuberculosis data came from the Chinese Center of Disease Control and Prevention(CDC) network monitor system , and TB / HIV co-infection data came from the investigation of 5 counties in Yuncheng city based on the fifth tuberculosis global fund project. Aimed at establishing the TB / HIV co-infection monitoring evaluation system, To provide scientific basis for trend prediction and control measures of the TB / HIV co-infection. To find out the severity of the tuberculosis infection and its influencing factors in HIV/AIDS patients using statistical analysis methods.MethodsSurvey and monitor were carried on TB/HIV co-infected patients from the 5 counties of Yuncheng city, including Ruicheng, Xiaxian, Xinjiang, Jiangxian and Jishan. The follow-up table coincident with the actual situation of co-infection of Shanxi was prepared and revised. Collected data were analyzed with rare event logistic regression, random effect logistic regression, bayesian estimation and so on. Programming was realized with software of SAS 9.1.3 and Stata 10.0.Collecting and comparing the data of the smear-positive tuberculosis patients from tuberculosis Information Management System database of CDC in Shanxi Province. Time series analysis (ARMA and ARIMA model) and Microsoft SQL Server Analysis Services data mining model - Microsoft Time Series algorithm were used to predict the trend of tuberculosis incidence of Shanxi province.Results1. Due to the tuberculosis Information Management System database of CDC could not included the information of TB / HIV co-infection,this article will make the database perfect, added the Follow-up survey form, Death registration form,incomes, nutritional status and so on.Based on the tuberculosis Information Management System database of CDC in Shanxi Province. To provide the baseline information for the prevention and treatment of TB/HIV co-infection.2. If the frequency difference of dependent variable between two types of values was disparate, the classical logistic regression might underestimate the probability of rare events.Therefore, we adjusted parameters and the estimated value of the probability to solve such problems. Examples in this article, the following methods were used,including the classical logistic regression, logistic regression prior correction, logistic regression MCN prior correction, logistic regression weighted correction and logistic regression MCN weighted correction.The Vuong test was used to compard among different models,the results showed that the logistic regression MCN weighted correction was fit model respectively. Maximum-likelihood estimation, weighted maximum-likelihood estimation, approximate unbiased estimation, approximate bayesian estimation were used to estimate Probability. The results showed that approximate bayesian estimation results optimal. According to Approximate Bayesian estimation, the tuberculosis infection rate of HIV / AIDS patients was about 0.05 in Shanxi Province.3. In this investigation, if groups effects among tuberculosis infection rate of HIV / AIDS patients in five counties were considered, there was more individual similarity, then the tuberculosis infection rate of HIV / AIDS patients were non-independent in the same region. Generalized linear mixed-effects model was used to establish logistic regression random effects for the data of HIV / AIDS patients with tuberculosis infection to solve the non-independent problem for the data of HIV / AIDS patients with tuberculosis infection. Examples in this article, the following methods were used the classical logistic Regression, including logistic Regression priori correction, logistic Regression MCN priori correction, logistic Regression weighted correction and logistic Regression MCN weighted correction to fit model respectively. The goodness-of-fit indicators showed that rare event random-effects logistic regression with weighted MCN correction model was better for data fitting. The levels of CD4 would affect the probability of HIV / AIDS patients with tuberculosis incidence. The logarithm of CD4 values changes one unit, The risk lower by 74.9 percent of tuberculosis infection in HIV / AIDS patients.4. Generalized linear mixed effect model required numerical integration for joint likelihood function or approximation of the model, and the restricted pseudo-likelihood was the first-order Taylor approximation of generalized linear mixed-effect model. We introduced bayesian estimation into generalized linear mixed-effect model, and the results showed that the posterior estimates were closed to the restricted pseudo-likelihood estimation if the priori noninformation was selected .5. The results showed that the model of ARIMA (1,1,0) (1,1,0)12 was better for fitting the data model. The average absolute error was 136.64 between prediction value and actual value, with an average relative error 8.10%. The forecasting results from 2009 showed that the cases of smear-positive in 2009 was much lower than that of previous years, climbing up from March to August, highest in April in 2009.The results of the Microsoft Time Series algorithm was consistented with the model of ARIMA (1,1,0)(1,1,0)12.6. Compared with the history predict results of ARIMA model, The average absolute error of Microsoft Time Series algorithm was 116.7, average relative error was 6.60 percents from January 2007 to August 2008, while the average absolute error of ARIMA model was 104.4, average relative error was 5.90 percent, the relative prediction error of ARIMA model was lower than Microsoft Time Series algorithm.Conclusions1. The content of TB/HIV co-infection was added to the revised form of tuberculosis monitor, it could evaluated the treatment effect of tuberculosis better, controlled TB/HIV co-infection more effective, provided theoretical basis for establishing cooperated pattern of TB and AIDS.2. The rare event logistic regression was superior to the classical logistic regression in the rare event analysis, it was worthy of promoted and applied for the rare disdeases. Vuong test was used to evaluated the different regression models.3. Under the bayesian assumption,noninformative priors were specified for the parameters in generalized linear mixed-effect model, applying MCMC for parameter estimation,the estimators were consistent with the results that had been gained by using restricted pseudo-likelihood. Bayesian models provided us an effective alternative for nonlinear mixed-effect model,because bayesian estimation did not rely on the asymptotication and approximation, it was more accurate and natural than restrictive pseudo-likelihood estimation under classical statistic. Especially the support of implementation software for bayesian analysis, bayesian models had much more attractive advantages. 4. Time-series model (ARMA model) could easily dealed with the smooth sequence , and fitting better, it is a practical mathematical model and prediction instrument for infectious diseases prediction, especially tuberculosis. Appropriate analytical model was the key of prediction effect.5. The Microsoft Time Series algorithm was introduced to the medical timimg data analysis firstly, we constructed a training model for the the cases of smear-positive in Shanxi Province from January 2005 to December 2008. It combined with self-regression and decision tree technology, enriching the prediction method of medical time-series data. Although Microsoft time sequence algorithm prediction error slightly larger than ARIMA models in this case, the robustness of the prediction effect need to continue to explore, but the principle was simple and easy to comprehend and operate, it was worthy of promoted as a new predicted algorithm.

  • 【分类号】R512.91;R52
  • 【被引频次】5
  • 【下载频次】464
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络