节点文献

医院信息数据挖掘及实现技术的探索

Data Mining of Hospital Information and Exploration of Its Practical Implementation

【作者】 易静

【导师】 王润华;

【作者基本信息】 重庆医科大学 , 临床检验诊断学, 2007, 博士

【摘要】 探索基于SPSS Clementine的在线医院数据挖掘技术的实现,达到节约资源、共享资源的目的。在此基础上,探讨数据挖掘技术在因素预测、疾病判别诊断、疾病关联分析中的应用,结合实例研究重庆市结核病流行过程及发展趋势、乳腺癌腋窝高位淋巴结转移的危险因素及判别分类模型和糖尿病与并发症的关联知识发掘。为临床管理人员、医务人员、科研工作者进行科学管理、提高诊疗水平以及开展医学研究提供辅助决策与综合分析的工具。当前信息领域内普遍存在的“知识发现”问题迫切需要研究和解决,就方法学而言,科学地选择适当的数据挖掘算法是获得准确知识规则的关键;而在线医院数据挖掘技术的实现对提高医院管理水平和医疗质量具有重要的应用价值。随着计算机技术、生物医学工程研究的飞速发展,计算机信息技术在医学领域广泛应用,使得大量医学信息被精确记录下来,积累了大量的数据资源,激增的数据背后隐藏着许多重要的有用信息。从这些大量的数据资源中挖掘深层次的、隐含的、有价值的知识显得越来越重要。到目前为止,在国内,数据挖掘技术在医疗服务领域的研究有所报道,但未见其在线分析系统的研究应用;针对不同目标的实际应用,科学地选择适当的数据挖掘算法的方法学研究尚属先例。本研究采用Java网络编程语言,实现基于SPSS Clementine的在线医院数据挖掘的技术。利用来源于重庆市三所医疗机构(重庆市结核病防治所、重庆医科大学附属第一医院、附属第二医院)的医院数据,包括结核病、乳腺癌和糖尿病的资料。分别采用ARIMA模型、BP神经网络模型、GM(1,1)模型对结核病发病率进行预测分析比较;采用Logistic模型、CHAID模型、RBFN模型、RBFN-Logistic混合模型、RBFN-CHAID混合模型对乳腺癌腋窝高位淋巴结转移判别分类比较;采用Apriori关联分析模型对糖尿病与并发症的关联强度进行描述。主要研究内容:①采用Java网络编程语言,对在线数据挖掘技术的实现进行探索。②分析重庆市结核病流行过程,乳腺癌腋窝高位淋巴结转移的危险因素以及糖尿病与并发症的关联。③采用ARIMA模型、BP神经网络模型、GM(1,1)模型对结核病发病率进行预测分析。④采用Logistic模型、CHAID模型、RBFN模型、RBFN-Logistic混合模型、RBFN-CHAID混合模型对乳腺癌腋窝高位淋巴结转移判别分类。⑤利用准确率(Accuracy)和可靠性(Reliability)指标评价模型的准确性和可靠性。研究结果表明:①初步整合了SPSS Clementine,实现了在线医院数据采集、执行引擎、分析结果处理和分析结果查询的流程处理。②结核病有明显的季节流行高峰,基本是每年一、三季度发病人数较少,二、四季度发病人数较多。一个结核病流行年各季度发病率与一年前的一个半结核病流行年各季度发病率有关系。对结核病发病率的预测必须考虑季节因素、周期性及随机因素的影响,才能做出准确的预测。③ARIMA模型、BPANN2模型和GM(1,1)模型比较,前两者对结核病发病率的预测的相对误差分别为0.05872和0.06999,GM(1,1)模型为0.01210,说明残差GM(1,1)模型对结核病具有较好的预测性能。④乳腺癌腋窝高位淋巴结转移与腋窝中低淋巴结状况、肿瘤大小有明显关系。⑤RBFN模型采用权值矩阵表达诊断知识,Logistic模型与RBFN-Logistic混合模型采用Logistic回归系数表达诊断知识,二者均不易被使用者解读;CHAID模型和RBFN-CHAID混合模型采用了自然语言以树型的方式表达,提高了结果的可理解性。⑥Logistic模型、CHAID模型、RBFN模型、RBFN-Logistic混合模型、RBFN-CHAID混合模型的平均预测准确率分别为83.34%、83.79%、85.61%、83.77%、79.74%,r ?1分别为0.0720、0.0625、0.0549、0.0766、0.0948。RBFN模型所获知识的可靠程度以及对测试集合测试的准确率明显优于其它算法。⑦CHAID模型提取的诊断规则描述简单易懂,应用方便,可判断各诊断指标对乳腺癌腋窝高位淋巴结转移诊断贡献的大小,从CHAID决策树型可见,中低淋巴结状况对乳腺癌腋窝高位淋巴结转移诊断起决定性作用,肿瘤大小则可作为诊断的重要指标。因此,CHAID模型是一种简便可行的计算机辅助诊断方法,可从病例自动提取诊断规则,具有较广泛的实用价值,可应用于其它疾病的诊断研究。⑧泌尿道感染、肾病、眼部病变、神经病变、高脂血症、高血压、心脏病、冠心病等与糖尿病具有明显并发倾向。结论:①在线医院数据挖掘技术是未来医院信息系统的重要组成部分,对提高医院管理水平和医疗质量,降低医院运营成本具有重要的应用价值。②明确了GM(1,1)模型是预测结核病发病率的最佳预测算法;乳腺癌腋窝高位淋巴结转移判别分类的最佳算法是RBFN模型,对判别分类准确率和可靠性排位紧随其后的CHAID模型也是极佳的选择,这是从使用者易理解性、判别分类准确率和可靠性角度综合之结果;Apriori关联分析模型作为医生的辅助工具,提示临床医生关注、研究泌尿道感染与糖尿病两者之间的真正关系。

【Abstract】 Objective It is worth establishing practical, simple-operated data mining software of hospital information based on SPSS Clementine via internet, with the integrated hospital information system, And then discussing the application of data mining on variable forecast, disease diagnosis and association rule of disease, and studying in the methodology of data mining that analyzing the prevalence status of tuberculosis and its trend in the future, the risk factors of the axillary III lymph nodes metastasis of breast cancer and its classification model, the association rule of diabetes and diabetic complication, using the optimum arithmetic of data mining. The online data mining of hospital information system not only can save money and share resources, but also can provide efficient tool of comprehensive analysis and making decision for clinical manager, doctor, nurse and other technician to administer scientifically, enhance the accuracy of diagnosis the effect of treatment, and make medical research. As the methodology of data mining stands, it’s the key-step of the exact obtained-knowledge that taking the optimum arithmetic of data mining scientifically. With the development of computer technology and biomedical engineering research, and the widely application of computer information technology in medicine field, a great lot of exact medical records were stored which contain a lot of important knowledge. It becomes more and more importance that mining the hidden, deep-seated, valuable knowledge from the lots of medical records, because it’s impendent solution on the‘Knowledge Discover’in the medical information field which can improve the manage level of hospital and advance the medical service quality. Up till the present moment, there have been some publications on the application of data mining in the medical service via internet in America, no in China, according to different practical data mining for different object, taking the optimum arithmetic of data mining scientifically has not been done in the study which existed.Method and Data Using Java network programming language and implementing of online data mining of hospital information system based on SPSS Clementine. Using Autoregressive Integrated Moving Average model (ARIMA), Back-Propagation Artificial Neural Network model (BPANN), Grey model (1, 1) (GM (1, 1)) to forecast the prevalence of tuberculosis and compare the accuracy of the three arithmetic, based on the data from the Anti-tuberculosis Institute of Chongqing. Using Logistic model (Logistic), CHAID model (CHAID), Radial Basis Functions Network model (RBFN), the combination model of the RBFN and the Logistic, the combination model of the RBFN and the CHAID to classify the status of axillary III lymph nodes of breast cancer and compare the accuracy and reliability of the five arithmetic, based on the data from the First Affiliated Hospital of Chongqing University of Medical Sciences. Using Apriori model to describe the association rule between diabetes and diabetic complication, based on the data from the Second Affiliated Hospital of Chongqing University of Medical Sciences.Studied①Using Java network programming language and explorating the implementation of online data mining of hospital information system based on SPSS Clementine.②analyzing the prevalence status of tuberculosis in Chongqing, the risk factors of the axillary III lymph nodes metastasis of breast cancer and the association rule between diabetes and diabetic complication.③Utilizing three arithmetic of data mining: ARIMA, BPANN, GM (1, 1) to predict the prevalence of tuberculosis and compare the accuracy of them.④Making a combination model through combining the RBFN and the Logistic, and combining the RBFN and the CHAID.⑤Utilizing the Logistic, CHAID, RBFN, the combination model of the BFN and the Logistic, and the combination model of the RBFN and the CHAID to classify the status of axillary’s III lymph nodes of breast cancer and to compare the accuracy and reliability with five arithmetic.Results①preliminary Setted up the data mining software of hospital information system via internet based on SPSS Clementine,implemented the data collecting, engine executing, result storing, and searching the result.②The prevalence of tuberculosis clearly show a seasonal moving regular, which manifests a wave phenomenon the whole year, in the first and third season the incidence goes down, while it increases in the other two seasons basically. There are correlation between a season of this year and six seasons of the year before last year. The predictive results will be right when consider the seasonal factor and circle random factors of tuberculosis.③The average relative error of predictive model of ARIMA, BPANN2, and GM (1, 1) are 0.05872, 0.06999, and 0.01210, respectively, means the GM (1, 1) is perfect for predicting the prevalence of tuberculosis.④There are significant correlation between the status of axillary III lymph nodes of the breast cancer and the status of axillary I and II lymph nodes, and the size of tumor.⑤Some expression of diagnostic knowledge are difficult to understand for user, for example, the expression of diagnostic knowledge of the RBFN is weight matrix, and the Logistic and the combination of RBFN and Logistic are logistic regression coefficient. But the expression of diagnostic knowledge of the CHAID and the combination of RBFN and CHAID are the tree plot using natural language which easy to understand.⑥The average predictive accuracy of the Logistic, the CHAID, the RBFN, the combination of RBFN and Logistic, and the combination of RBFN and CHAID are 83.34%, 83.79%, 85.61%, 83.77%, and 79.74%, respectively. And the absolute values of the reliabilities minus 1 of them are 0.0720, 0.0625, 0.0549, 0.0766, and 0.0948, respectively. The accuracy and reliability of the RBFN is higher than other arithmetic in the five methods, means that the RBFN is the best arithmetic for classifying the status of axillary III lymph nodes of breast cancer.⑦The influence order of the diagnostic indexes can be found from the diagnostic knowledge of the CHAID, which is described by a chart of tree, the status of axillary I and II lymph nodes, and the size of tumor are very important for classifying the status of axillary III lymph nodes of breast cancer. The CHAID is a simple, practical diagnostic method based on the computer which can automatically pick up diagnostic knowledge from records. So it can be widely applied on breast cancer and other diseases research.⑧There are eight diseases such as infected-urinary, diabetic nephropathy, diabetic ophthalmia, diabetic neuropathy, hyperlipemia, hypertension, diabetic cardiopathy, coronary heart disease, which are significant relative to diabetes.Conclusions①had preliminary implemented the online data mining of hospital data based on SPSS Clementine, which is very important part of the hospital information system. It will enhance the use of computer information technology, which will improve the manage level of hospital, advance the medical service quality, reduce the hospital operation price, when the hospital information system combined with the data mining.②Making clear and confirming that the GM (1, 1) is perfect for predicting the prevalence of tuberculosis. The RBFN and the CHAID is the best two kind of arithmetic for classifying the status of axillary III lymph node of breast cancer, which is the result of analyzing the expression of diagnosis knowledge and the accuracy and reliability of the five arithmetic methods. As an assistant tool, Apriori can make doctor to research the real correlation between the diabetes and infected-urinary which seldom reported in the medical journal.

  • 【分类号】R197.324
  • 【被引频次】17
  • 【下载频次】2345
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络