节点文献

面向云环境的重复数据删除关键技术研究

Research on Key Technologies of Data Deduplication for Cloud Environment

【作者】 付印金

【导师】 肖侬;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2013, 博士

【摘要】 随着大数据时代的到来,信息世界的数据量呈爆炸式增长,数据中心的数据存储和管理需求已达到PB级甚至EB级。研究发现,不论是在备份、归档存储层,还是在常规的主存储层,日趋复杂的海量数据集中都有大量的重复数据。传统的数据备份技术和虚拟机镜像存储管理方法更是加速了重复数据的增长。为了抑制数据过快增长,提高IT资源利用率,降低系统能耗以及管理成本,重复数据删除技术作为一种新兴的数据缩减技术,已成为当前学术界和工业界的研究热点。云计算作为大数据的关键支撑技术,通过网络计算和虚拟化技术优化资源利用率,为用户提供廉价、高效、可靠的计算和存储服务。针对具有大量冗余数据的云备份和虚拟桌面云环境,重复数据删除技术能够极大地降低存储空间需求和提高网络带宽利用率,但也存在系统性能上的挑战。本文主要讨论:如何利用重复数据删除技术优化个人计算环境云备份服务、数据中心分布式云备份存储系统以及虚拟桌面云集群存储系统,以提高IT资源利用率和系统扩展性,降低数据消重操作对I/O性能的影响。本文在全面了解当前云计算技术发展现状的基础上,深入分析和研究了基于重复数据删除技术的云备份、大数据备份和虚拟桌面云等应用,并提出了新的系统设计和算法。主要工作和创新如下:(1)提出了基于个人计算环境云备份服务的分级应用感知源端重复数据删除机制ALG-Dedupe。本文通过对大量个人应用数据进行统计分析,首次发现了不同类型应用数据集之间共享的数据量可以忽略不计。利用文件语义指导应用数据分类,设计了应用感知的索引结构,允许应用数据内部独立并行地进行重复数据删除,并可以根据各类应用数据的特点自适应地选择数据划分策略和指纹计算函数。由于客户端本地冗余检测和云数据中心远程冗余检测这两种方法实现的源端消重策略在响应延迟和系统开销上互补,将应用感知的源端重复数据删除分为客户端的局部消重和云端的全局消重两级来进一步提高数据缩减率和减少消重处理时间。通过实验表明,ALG-Dedupe在极大提高重复数据删除效率的同时,有效地缩减了数据备份窗口和云存储成本,降低了个人计算设备的能耗和系统开销。(2)设计了一种支持云数据中心实现大数据备份的可扩展集群重复数据删除方法E-Dedupe。该方法的新颖之处在于同时开发了数据局部性和相似性来优化集群重复数据删除。E-Dedupe结合集群节点间超块级数据路由和节点内块级重复数据删除处理,在提高数据缩减率的同时,保持数据访问的局部性;通过扩展Broder的最小值独立置换理论,首次提出采用手纹技术来提高超块相似度的检测能力;通过节点的存储空间利用率加权相似度,设计了基于手纹的有状态超块数据路由算法,将数据按超块粒度从备份客户端分配到各个重复数据删除服务器节点。利用超块手纹中的代表性数据块指纹构建相似索引,并结合容器管理机制和数据块指纹缓存策略,以优化数据块指纹查询性能。通过采用源端在线重复数据删除技术,备份客户端可以避免向目标路由节点传输超块中的重复数据块。通过大量实验表明,E-Dedupe能够在获得集群范围内高数据缩减率的同时,有效地降低了系统通信开销和内存开销,并保持各节点负载平衡。(3)提出了一种基于集群重复数据删除的虚拟桌面云存储优化技术。为支持可扩展的虚拟桌面云服务,虚拟桌面服务器集群需要管理大量桌面虚拟机,本文通过开发虚拟机镜像文件的语义信息,首次提出了基于语义感知的虚拟机调度算法来支持基于重复数据删除的虚拟桌面集群存储系统。同时,结合服务器的数据块缓存和本地混合存储缓存,设计了基于重复数据删除的虚拟桌面存储I/O优化策略。实验分析表明,基于重复数据删除的虚拟桌面集群存储优化技术有效地提高了虚拟桌面存储的空间利用率,降低了存储系统的I/O操作数,并改进了虚拟桌面的启动速度。通过上述几项基于云环境中的重复数据删除关键技术研究,我们为未来云存储和云计算研究提供了有力的技术支撑。

【Abstract】 With the coming of Big Data era, the data capacity in information world is growingat an explosive rate, and the scale of dataset needing storage and management in datacenter can be easily expanded to Petabytes, or even Exabytes. We know the results fromsome previous works: there is large amount of data redundancy exists in the massivedatasets of both backup/archive storage and primary storage. Traditional data backuptechniques and virtual machine image management can magnify the duplication bystoring redundant data over and over again. In order to restrain the excessive datagrowth, improve IT resource utilization, reduce system power consumption and savemangement cost, data deduplication, as a novel data reduction technology, has become ahot topic in academia and industry.As a key technology to support Big Data, cloud computing can optimize resourceefficiency by network computing and virtualization technique, to provide cost effective,high efficient and reliable computing and storage services for users. In cloud backupand virtual desktop cloud environment, deduplication can significantly reduce the re-quirement of storage space, and improve the effficieny of network bandwidth due tohigh data redundancy in these services, but it also brings new challenges on systemperformance. This thesis discusses how to apply deduplication to optimize cloud backupservice in personal computing environment, distributed cloud backup storage system indata center and cluster storage system in virtual desktop cloud, so that the storage spaceefficiency and system scalability can be significantly improved, and the negative impactof deduplication process on I/O performance becomes negligible. In this thesis, afterdeeply understanding the development of current cloud computing technology, we studythe deduplication based cloud backup, Big Data backup and virtual desktop cloud, andpropose creative system designs and novel algorithms. In summary, the main contribu-tions and innovations of this thesis are as follows:(1) Proposes ALG-Dedupe, an application-aware local and global sourcededuplication scheme for cloud backup services of personal computing environment.This thesis firstly discovers that the amount of data shared among different types of ap-plications is negligibly small by conducting a content overlapping analysis on massivepersonal datasets. According to the semantic based application classification, an appli-cation-aware index structure is designed, to improve the efficiency of deduplication byeliminating redundancy in each application independently and in parallel. It can alsoreduce the computational overhead by employing an intelligent data chunking schemeand an adaptive use of hash functions based on application awareness. To balance net-work latency and system overhead in personal devices, ALG-Dedupe combines cli-ent-side local redundancy detection with cloud-side global redundancy detection to im- prove data reduction ratio and reduce deduplication time. The experimental results showthat ALG-Dedupe can improve the deduplication efficiency significantly, shorten back-up window, save cloud cost, and reduce power consumption and system overhead inpersonal computing devices.(2) Designs-Dedupe, a scalable inline cluster deduplication method for Big Databackup. The novelty in our study lies in exploiting both locality and similarity in backupdata streams to optimize cluster deduplication. It combines inter-node super-chunk leveldata routing in cluster with intra-node chunk level deduplication process, to imprve datareduction ratio and keep data locality in each node. Inspired by the generalization ofBroder’s Theorem,-Dedupe is the first application of handprinting in the context ofcluster deduplication to improve the ability of similarity detection. After discount thesuper-chunk resemblances with storage usage in nodes, the handprint based stateful datarouting algorithm assigns data from backup clients to each deduplication server node atsuper-chunk level.-Dedupe builds a similarity index with super-chunk handprints overthe traditional container based locality-preserved chunk-fingerprint caching scheme toalleviate the chunk index lookup disk bottleneck. The backup clients can avoid transfer-ring duplicate data chunk to target deduplication server node over the network by per-forming source deduplication. Finally, we conduct a large number of experiments toshow that-Dedupe can maintain high cluster-wide data reduction ratio, reduce systemcommunication overhead and memory cost, with balanced workload in each node.(3) Proposes a cluster-deduplication based virtual desktop storage optimizationtechnique. To support virtual desktop cloud service, virtual desktop server cluster isneeded to manage large amount of desktop virtual machine. This thesis is the first re-search work to provide a virtual machine scheduling algorithm to optimizededuplication based virtual desktop storage by leveraging semantic awareness in virtualmachine images. Meanwhile, it combines chunk cache in server with local hybrid stor-age cache to improve the I/O performance of deduplication based virtual desktop stor-age. The experiments show that our optimization can improve the space efficiency ofvirtual desktop storage, reduce I/O operations, and enhance the virtual desktop start-upperformance.By studying the above key techniques of deduplication in cloud environment, weprovide a powerful technical support for the future of cloud storage and cloud compu-ting.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络