节点文献

Hadoop下基于贝叶斯分类的气象数据挖掘研究

The Research of Meteorological Data Mining Using Bayesian Classifier Based on Hadoop

【作者】 刘寅

【导师】 薛胜军;

【作者基本信息】 南京信息工程大学 , 气象信息技术与安全, 2012, 硕士

【摘要】 随着气象事业现代化水平不断提高,气象信息资料的数量也越来越庞大,如何高效的处理和计算这些海量的气象数据成为了气象数据挖掘领域中一个重要的问题。分布式技术为解决这一问题提供了可能,其已经成为气象数据挖掘的应用基础。本文在分析气象数据的特征和处理过程的基础上,选取中国地面气候资料日值数据集中江苏省(徐州、赣榆、南京、东台)4站自1951年至今气象数据资料作为研究对象,主要做了如下的工作:(1)分析了开源云平台Hadoop的相关技术,重点研究了MapReduce编程模型、作业流程和关键技术,并基于MapReduce编程思想实现降雨量分级统计实验,结果表明该数据集的降雨量缺测漏测现象很低,可以作为研究对象。(2)研究了朴素贝叶斯(NB)在降雨量分类中的应用方法。针对气象数据集的特征,利用相关系数和PKI离散化法对预测因子进行选择和离散化。通过数据集训练和测试得到分类精度,并从预测因子时间连续性、概率计算中下溢情况和离散化方法3方面分析了NB分类器在降雨量分类应用中的不足。(3)针对NB在降雨量分类预测研究中存在的不足以及在大型气象数据处理中的效率问题,对NB中预处理、模型训练和精度评估三个过程进行MapReduce化,提出了基于MapReduce模型进行有效改进的朴素贝叶斯分类器(MRNB)。通过降雨量分类实验证明,与NB分类器相比较,本文提出的MRNB分类器能充分利用集群的资源,提高了大数据量的挖掘效率,且在大型气象数据集分类中获得了更好地精度。MRNB分类器具有很好的扩展性,为以后在海量气象数据中分类挖掘相关方法提供了更好的解决方案。

【Abstract】 As the modernization of the meteorological service is improving sustainably, how to process and calculate the vast amounts of meteorological data efficiently have been an important issue in the field of data mining in meteorology. Distributed technology has become the foundation to apply data mining technology in meteorology which makes it possible to deal with those data in more efficiently way.Based on analyzing the characteristics and processing of meteorological data, we select Chinese terrestrial climatic data sets of daily records in four stations (Xuzhou, Ganyu, Nanjing, Dongtai) in Jiangsu Province since1951for the study. The major work of this paper can be described as follows:(1) Introduce the related technology of the open source cloud platform Hadoop and focus on the description of the programming model, job process and key technologies of MapReduce. Meanwhile, by using the MapReduce programming ideas, we make the rainfall data classification and statistics experiment. The result shows the data sets we choose can be used for the study for the amount of the absence and missing data of the rainfall data in the data sets is very little.(2) Naive Bayes (NB) classifier is recommended and used in the rainfall data classification. In consideration of the characteristics of meteorological data sets, we use correlation coefficient and PKI discretization method to select and discrete predictors. By training and testing the data sets to get classification accuracy, we analyze the NB classifier’s applying shortage in rainfall data classification by three aspects:the predictors’time continuity, the underflow situation of probability calculations and discretization method.(3) Considering the problems that NB classifier’s shortage in the study of rainfall data classification and its low processing efficiency in handle vast amount of meteorological data, the paper gives an improved based on MapReduce model Naive Bayes classifier (MRNB) which achieves mainly by operate MapReduce ideas on three process:preprocessing, model training and the accuracy assessment.Compared with the NB classifier, the proposed MRNB classifier can make full use of cluster resources, improve the data-mining efficiency of the massive data, and get better accuracy in the classification of massive meteorological data sets which can be identified by the rainfall data classification experiment. The improved classifier has good scalability which also provides a better solution for the future’s classified data mining in massive meteorological data.

【关键词】 数据挖掘朴素贝叶斯HadoopMapReduce降雨量
【Key words】 Data MiningNa(i|¨)ve BayesHadoopMapReduceRainfall
节点文献中: 

本文链接的文献网络图示:

本文的引文网络