节点文献
大数据服务若干关键技术研究
Research on Some Key Technologies of Big Data-as-a-Service
【作者】 韩晶;
【导师】 宋美娜;
【作者基本信息】 北京邮电大学 , 计算机科学与技术, 2013, 博士
【摘要】 大数据是现代信息技术的重要发展方向之一,实现大数据的共享和分析将带来不可估量的经济价值,同时也对社会产生巨大的推动作用。在大数据时代,对大数据进行统一表示,实现大数据处理、查询、分析和可视化是亟需解决的关键问题。大数据服务(Big Data-as-a-Service, BDaaS)是一种新的数据资源使用模式和一种新的服务经济模式,它通过将各类大数据操作进行封装,对服务消费者提供无处不在的、标准化的、随需的检索、分析与可视化服务交付。目前针对大数据服务的研究还处于概念讨论阶段,因此仍然面临四方面挑战:1)缺乏一种能够屏蔽数据资源和操作复杂性,面向用户体验的规范化大数据服务架构;2)缺乏体现用户行为特征的通用非结构化数据模型,使得非结构化大数据服务难以构建;3)已有数据服务模型仅描述服务接口规范,而覆盖大数据特征的大数据服务模型还未出现;4)在大数据检索、分析和可视化服务提供和服务能力优化方面,缺乏相应的解决方案。为了解决以上问题,需要对大数据服务的理论模型、服务模型、实现方法等进行系统地研究。因此,本论文研究大数据服务架构、大数据服务数据模型、大数据服务模型,以及大数据服务应用四方面关键技术。为了能够对大数据服务平台构建提供规范化架构方案,本文首先设计了面向用户体验的大数据服务架构(User Experience-oriented Big Data-as-a-Service Architecture,UE-BDaaSA);其次,在数据模型方面,为实现面向非结构化数据的大数据服务,设计了基于主体行为的非结构化数据模型;在大数据服务模型方面,通过进程代数建立了大数据服务及其组合的代数模型,并设计了基于扩展OWL-S语义本体的大数据服务;在大数据服务应用方面,详细阐述了检索、分析和可视化服务的处理流程,并通过提高检索服务准确度和服务效率两方面措施实现了大数据服务能力优化。本文研究中产生的主要创新点有:(1)针对已有非结构化数据模型难以满足大数据服务构建需求的问题,提出了一种基于主体行为的非结构化数据星系模型(Galaxy Data Model, GDM)。通过监控数据产生者行为和数据产生背景,设计覆盖用户行为、语义背景等全方位数据特征的通用非结构化数据模型,为实现非结构化大数据服务提供了数据模型基础。实例验证结果表明,GDM具有较好的通用性和全面性,还具有轻量级的实现和成熟易用的操作语言。除传统文件系统外,GDM还支持对HDFS中的非结构化数据建模和检索。此外,GDM已经在国家免费孕前优生健康检查管理信息系统中实际应用,验证了其可行性和实用性。(第三章)(2)针对缺乏能够涵盖大数据特征的服务模型的问题,提出了一种基于扩展OWL-S本体的大数据服务模型(Extended OWL-S based Big Data-as-a-Service, EO-BDaaS)。通过在OWL-S中扩展数据源、数据服务类型、数据服务操作等属性,实现检索、分析、可视化等多类型大数据服务的构建和动态组合。实例验证结果表明,与已有数据服务相比,EO-BDaaS在属性和操作描述方面更加完备,且具有较强的语义理解能力和自动服务组合能力,还将数据服务特有的组合运算无缝地融入大数据服务的实现中。(第四章)(3)针对大数据检索服务准确度较低的问题,提出了热度敏感的非结构化数据检索排名优化算法HotRank。通过非结构化数据属性和服务消费者任务属性的匹配度来计算检索结果的热度分值,并基于热度分值对检索结果进行排序,从而实现了检索结果优化,使检索结果更加符合用户偏好。仿真实验表明,HotRank的正确率-召回率优于Windows Search排名算法,因此HotRank能够很好的提高大数据服务检索结果的准确度,实现了通过提高用户体验来提高大数据服务能力。(第五章)(4)针对大数据服务中对服务快速响应的要求,本文提出了一种基于数据热度识别的混合预取算法(Hybrid Prefetch Algorithm, HPA)。通过分析用户数据操作记录建立数据热度判定规则,根据动态和静态预取规则获得预取候选数据,最后将预取数据置入缓存。仿真实验结果显示,HPA的预取平均命中率为55%,平均准确率为43%,这表明该算法具有很好的用户操作数据预测和优化能力,同时也从服务效率方面优化了大数据服务能力。同时,基于HPA的分布式持久化缓存存储架构已在国家免费孕前优生健康检查管理信息系统中进行了应用,验证了其有效性。(第五章)本论文的研究内容作为“十一五”国家科技支撑计划项目“安全可信的电信级生殖健康服务运营支撑体系关键技术研究”(编号:2008BAH24B04)和教育部-中国移动科研基金项目“面向互联网的业务支撑系统关键技术及方案研究”(编号:MCM20123031)的部分成果,己在实际运营的“国家孕前免费健康检查管理信息系统”中应用,帮助其实现了从人口计生领域数据采集到跨域人口计生大数据的共享和可视分析服务化的演进,为电子政务云计算国家工程实验室“电子政务云计算数据服务平台”建设提供了有效的解决方案和工程实践指导。
【Abstract】 Nowadays, Big data has become an important direction of development of modern information technology, and sharing and analysis of big data would not only bring immeasurable economic value, but also play a significant role in promoting the development of society. Big Data-as-a-Service (BDaaS) is a new data resource usage pattern and a new form of service economy, by encapsulating heterogeneous data, it can provide ubiquitous service consumers, standardization, on-demand services, including search, analysis or visualization.Due to the research of BDaaS is in the conceptual discussion stage, it still faces four challenges:1)There is no standardized, user experience based BDaaS architecture which can shield the complexity of data sources and operations;2)The lack of generic unstructured data model which reflects user behavior characteristic, made BDaaS for unstructured data difficult to build;3)Existing data model follows the Web services model, however, so far, holistic BDaaS service model with the characteristics of big data has not yet appeared;4)There is no appropriate solution in providing data retrieval, analysis and visualization services and optimizing service capacity.In order to solve the above problems, four key technologies of BDaaS architecture, data model, BDaaS service model, as well as BDaaS applications will be in-depth study. Firstly, this paper designed a User Experience-oriented BDaaS Architecture, so as to provide a high level of standardization guidance for building a platform. Secondly, in terms of the data model, in order to unify description unstructured data, the user behavior-based unstructured data model has been designed. Thirdly, in terms of the service model, algebraic model has been established by using process algebra, and then extended OWL-S ontology-based BDaaS model and the service composition approach were designed. Finally, service processes of retrieval, analysis and visualization have been described in detail, and the two measures of improving the retrieval services accuracy and service efficiency have been used to optimize the BDaaS capacity.The main innovations points of this paper are show as follows:(1) As existing unstructured data models were difficult to meet the demand for BDaaS, the Galaxy Data Model (GDM) has been proposed, which is a user behavior based unstructured data model. By monitoring the behavior of data generator people, a generic model with fully attributes like user behavior, semantic background have been created, which is the basis for the realization of the BDaaS for unstructured data. The case study shows GDM not only has good versatility and comprehensiveness, but also has a lightweight, easy-to-use description language and operating language. In addition to the traditional file system, GDM also supports modeling and retrieval of unstructured data in HDFS. In addition, GDM has application in the National Pre-pregnancy Check Information Management System (NPCIMS) to verify its feasibility and practicality.(In chapter three)(2) Due to the holistic BDaaS service model with the characteristics of big data has not yet appeared, Extended OWL-S based Big Data-as-a-Service model(EO-BDaaS) has been proposed. By add properties of the data sources, data types, service operation in the OWL-S in order to build many types of BDaaS such as search, analysis, visualization, and to compose service dynamically. Case study shows, compared with the existing data services, EO-BDaaS with a more comprehensive description on attributes and operations. Besides, it has capabilities such as strong semantic comprehension and automatic service composition, and integrated the unique combination operations of BDaaS into the implementation of data services seamlessly.(In Chapter four)(3) To solve the problem of low accuracy of retrieval services, this paper presents the heat sensitive unstructured data retrieval ranking algorithm HotRank. First heat score was calculated, which is the match degree between the tasks attributes of search results and task attributes of services consumers, after that assigned the scores to each of the search results, and then sorted search results based on heat score. By using such means to make search results more in line with the preference of the user. The simulation results show that, the Precision-Recall of HotRank is better than Windows Search ranking algorithm. Therefore as the improving of retrieve accuracy, HotRank is able to optimize not only the user experience, but also the service capacity.(In Chapter five)(4) A data heat recognition-based Hybrid Prefetch Algorithm (HPA) has been proposed to meets the quickly respond requirements of the BDaaS. First, by analyzing the log of user data operation and develop data heat determine rules, then according the dynamic and static prefetch rules to get candidate data, finally prefetch data would be take into the cache. The simulation results show that average hit rate of HPA is55%, the average accuracy rate of HPA is43%, which indicates that the algorithm not only has good ability to predict user operation of data, but also to optimize the BDaaS capacity. In addition, HPA-based Hybrid Prefetch based Persistent Caching architecture has been applied in the National Pre-pregnancy Check Management Information System (NPCMIS) in order to verify its effectiveness.(In Chapter five)The research content of this thesis, as the academic achievements of National Key project of Scientific and Technical Supporting Programs "Research on a safe, reliable, carrier-class operation support system of reproductive health services"(No.2008BAH24B04) and Science Foundation of Ministry of Education of China-China Mobile Program "Research on key technologies and solutions of internet-oriented business support system"(NO.MCM20123031), has been applied in NPCMIS and help them achieve the evolution from data acquisition to BDaaS. In addition, it has provided "The National Cloud Computing E-Government BDaaS platform" of National Engineering Lab of Cloud Computing E-Government with an effective solution and project practice guidance.
【Key words】 Big data-as-a-Service; unstructured data; DataModel; Service Model; Search ranking algorithm;