节点文献

内容感知存储系统中信息生命周期管理关键技术研究

Research on Key Technologies of Inforamaton Lifycycle Management in Content Aware Storage System

【作者】 聂雪军

【导师】 周敬利;

【作者基本信息】 华中科技大学 , 计算机系统结构, 2010, 博士

【摘要】 随着存储系统智能化需求的不断提高,越来越多的应用层功能开始融入存储系统,例如自主管理,数据安全以及信息检索等。传统的存储系统以块级或对象级数据处理为主,缺乏文件级信息,无法将信息生命周期管理(Information Lifecycle Management,ILM)功能融入存储系统。遵循XAM(eXtensible Access Method)规范的内容感知存储(Content Aware Storage)系统,由于使用内容元数据(Content Metadata)对数据的文件级信息进行传载,因此为ILM融入存储系统提供了基础。研究ILM融入内容感知存储系统过程中涉及的关键技术,围绕着内容元数据构建信息整合、内容分类、分级存储、数据备份以及信息归档等ILM数据处理阶段。研究工作主要包括:提出并实现了一种基于内容元数据的信息整合方法。针对ILM数据处理需求制定了内容元数据规范,包括内容元数据的定义、提取、表示以及传输。以内容元数据为基础,从外在形式和内部语义两方面实现了非结构化信息数据的整合。设计并实现了支持内容元数据规范的存储系统原型,性能测试表明信息整合提高了数据预处理的速度,同时对存储系统的平均I/O性能影响极小。提出并实现了一种面向内容元数据的信息分类算法。针对内容元数据的分类特征数量少但语义质量高的特性,构造了一种基于特征词集合的内容元数据相似度计算模型。该模型根据训练样本中的特征词集合构造相似度矩阵,并通过对矩阵进行平滑运算计算特征词之间的隐式相关性,以此为基础计算内容元数据的特征矢量。基于特征矢量,采用K-Means算法构造数据分类器。性能测试表明,该算法比传统的数据分类算法有着更高的精确度和互信息,并极大地降低了分类计算的时间。提出并实现了一种内容元数据驱动的分级存储模型,包括基于应用需求的分级存储与基于成本需求的分级存储。前者满足信息在备份、归档、安全以及访问控制等应用上的需求,后者侧重于降低单位信息的存储成本同时确保存储系统的I/O性能。提出了一种基于速率控制的自适应数据迁移算法,将数据迁移I/O对存储系统正常I/O的影响降至最低。性能测试表明,内容元数据驱动的分级存储模型能有效满足的信息数据的存储需求,同时不影响存储系统的整体性能。提出并实现了一种基于内容特征的重复数据删除算法。针对当前数据备份中重复数据删除算法未考虑不同文件类型的内容在比特值分布上的差异,采用候选边界直方图来表示文件类型的内容特征,并在此基础上对传统重复数据删除算法的关键参数进行优化。算法以降低不同文件类型之间的数据缩减率为代价,换取相同类型文件之间数据缩减率的提高。设计了一种支持变长数据块高效存储的文件系统TDFS。性能测试表明,该算法在特定数据集上对数据缩减率(Reduction Ratio)有较大提高。提出并实现了一种基于内容元数据的信息归档模型。通过引入支持OAIS(Open Archival Information System)归档规范的内容元数据标签,实现信息的逻辑保存。提出一种基于磁盘的软件WORM(Write Once Read Many)模型,通过修改磁盘功能划分以及对iSCSI命令的响应行为,实现信息的物理保存。通过对归档文件加密并在保存逾期后销毁密钥,实现了信息的安全销毁,同时提出了一种基于时间窗口的密钥管理机制降低密钥管理复杂度。性能测试表明,基于内容元数据的信息归档模型能有效满足归档信息的功能需求与性能需求。实验表明,内容感知存储系统能有效解决传统存储系统中缺乏文件级语义的问题,通过以内容元数据为核心来构建ILM模型中的关键数据处理阶段,不仅能简化ILM融入存储系统的复杂性,同时还能极大提高数据访问性能,满足存储系统的智能化需求。

【Abstract】 Intelligent storages need to integrate application layer functions into storage layer, such as self management, data security and information retrieval. However, Information Lifecycle Management (ILM) can not be integrated into traditional storage systems because they lack file-level information which will be needed in various stages of ILM. The Content Aware Storage (CAS), which is based on XAM specification, provides supports for such intergration. By wrapping file-level information into content metadata, CAS can provide complete computing information for data processing in ILM, which provieds the basis for integrating ILM into storage systems.The paper proposes several key technologies that are involved in integrating ILM stages into CAS, including information integration, content classification, tiered storage, data backup and information archival. The main work includes:Propose an information integration model based on content metadata. Propose content metadata specification based on requirements of ILM, which includes the definition, extraction, representation and transportation of content metadata. The information is integrated in the form of both outer format and inner semantic. Design and develop a prototype of CAS that supports the content metadata specification. The experiment result shows that information integration degrades I/O performance very little.Propose a content metadata oriented information classification algorithm. Design a computing model for similarity between content metadata, which overcome the limitation of lacking enough character words. The model constructs a similarity matrix for characteristic words based on the explicit relations in train sample file collection, then calculates the implicit relations by matrix smoothing algorithm and obtains a set of linealy independent vectors, by which the characteristic vectors of content metadata are calculated. The data classifier is constructed based on the characteristic vectors and K-Means clustering algorithm. The experiment result shows that this classification algorithm can achieve higher accuracy and mutual information than traditional classification algorithm, and significantly reduce the computing time.Propose a content-metadata-driven tiered storage model, including application requirement based tiered storage and cost requirement based tiered storage. The former is to satisfy application requirements of information, such as backup, archival, security and access control, and the latter is to reduce storage cost, while guarantee the overall I/O performance. Propose an adaptive data migration algorithm based on migration speed control, which minimizes the negative impact of migration I/O on normal I/O. The experiment result shows that the model can effectively guarantee that the tier computing and data migration will not degrade performance of storage system, while reduce the storage cost of information.Propose a data de-duplication algorithm based on content characteristics. By introducing candidate chunk boundary histogram, the algorithm takes into account the difference between different file types, and optimizes the key parameters of traditional de-duplication algorithm based on candidate chunk boundary histogram. The key idea of this algorithm is to trade the redundancy among files of different types for that among same types. Propose a file system TDFS to storage the various length chunks. The experiment result shows that the algorithm can improve the data compression ratio on average by 9.0% on some special data sets.Propose an information archival model based on content metadata. By introducing content metadata tags that support OAIS specification, the model achieves the logical preservation of information. By modifying disk functions and the response of iSCSI commands, the model achieves a disk-based soft WORM and physical preservation of information. Propose a key based security destruction of information, which encrypts archival information and delete the key when preservation is overdue. Propose a time-based management for encryption keys, which significantly reduces the complexity of keys management. The experiment result shows that the archival model can satisfy the requirements of both function and performance.As experiments show, the content aware storage system can effectively resolve the problem of lacking file-level semantics in traditional storage system. By constructing key data processing stages of ILM based on content metadata, the complexity of integrating ILM into storage systems can be greatly reduced, and the data I/O performance can be improved, which satisfies the requirements of intelligent storage systems.

节点文献中: