节点文献

基于对象的并行文件系统接口语义扩展研究

Research on Interface Semantic Extension for Object-based Parallel File System

【作者】 涂旭东

【导师】 冯丹;

【作者基本信息】 华中科技大学 , 计算机系统结构, 2011, 博士

【摘要】 随着计算机技术的迅猛发展,采用新型存储管理和高性能互联技术设计与构建数以千计节点的高效存储系统已成为可能。但如何在同样硬件配置下,发挥软件更高并行性,以解决随着存储需求的增长导致出现的系统性能瓶颈和扩展性问题始终是个难题。当前与大规模存储集群配套的是可扩展并行文件系统,如GPFS、PVFS、Ceph、Lustre和PanFS等,而且全球Top 500超级计算的存储方案基本上采用的都是上述这些系统。因此重点研究了高效并行文件系统的架构及其实现方法,为使所设计的并行文件系统更好地支持高性能计算,由并行文件系统与应用的接口层入手,还研究了:通过扩展接口语义来优化系统性能,以满足高性能计算的I/O需求;通过感知布局优化并行作业I/O的访问模式以更好适应数据密集型可扩展计算;并行文件系统对新型并行计算的支持方法;并行文件系统中冗余编码并行化问题等。设计并实现了一种基于对象的并行文件系统CapFS,具有如下特点:可定制的数据布局模式、基于对象的远程直接数据访问和具有事务性持久化存储管理。提出了一种嵌套RAID模式的统一层次化数据布局模型和算法,实现了客户端驱动的可定制数据布局,并保持POSIX语义的完整性。针对对象存储设备规范中存在的扁平名字空间服务和可扩展属性管理问题,提出了一种基于内核级微数据库管理引擎并结合文件系统的高效对象访问与持久化存储管理方法,实现了变长对象持久化存储和结构化属性的高效查询。提出一种基于对象存储协议和远程进程调用RPC的对象直接访问方法,可向存储客户端提供多网络设备共享访问模式,并提供独立数据表示层来保证对象存储节点的多协议协商。原型系统CapFS的参数设置和整体性能的测试结果表明该系统具有较好的性能和可扩展性。分析发现传统文件系统接口(如:POSIX)语义不能很好地支持高性能计算需求,而并行计算应用的I/O访问模式通常由访问大量小文件、不连续的数据块组成,而且强调文件并发访问、非邻接访问和高元数据率,同时要求I/O协同访问。为使存储系统的I/O模式更好地支持新型计算,采用对POSIX接口进行扩展的办法,从接口扩展和语义保持角度出发,提出了包括基于文件共享描述符的I/O并发优化、面向非连续I/O的接口支持、延迟与批量元数据操作优化和保持POSIX语义的布局控制等四种接口扩展方法。测试表明扩展后的接口较已有方法具有更好的性能。分析并行计算框架MapReduce中的I/O模式发现,传统并行计算框架存在着中间数据拷贝和通信代价过高的缺点。分析了传统分布式文件系统和本文实现的并行文件系统CapFS在支持并行计算框架的异同和优劣,提出了扩展CapFS参数化布局的I/O感知接口实现MapReduce计算的框架模型"MapReduce over CapFS"。I/O基准测试和实际数据密集型应用验证了该模型利用存储节点计算资源进行数据处理可有效降低中间数据规模和减少计算节点同存储系统之间的数据传输量。此外,计算密集、I/O密集与计算和I/O都密集的三类应用测试结果还表明该模式尤其对包含I/O密集的应用可提供更高的系统加速比。提出利用并行计算框架分析与设计系统纠删编码算法的方法,给出了一种基于MapReduce模型的冗余编码并行算法,提高了系统冗余编码效率,保障系统可靠性。基于该算法,在CapFS中实现了一个异步的冗余编码计算框架,可支持不同粒度的冗余配置,具体包括单个文件级策略、多用户多文件组策略和直接面向存储设备的对象级、对象分组和分区粒度的集成。通过编码损耗率模型分析了算法复杂度,并通过Yahoo提供的元数据Trace对系统时空复杂度进行了仿真试验,结果显示了按照文件、用户文件分组和分区对象集合三种逐步增大粒度的冗余计算对提高空间利用率的变化,表明可在数据可靠性和维护代价上自适应调整。在通常高性能计算规模下,因冗余计算由后端存储节点异步完成,且校验数据不占用应用I/O传输,并行纠删码冗余计算开销较低。

【Abstract】 With the rapid development of information technology, recent advances in storage sys-tem technologies and high performance interconnects have made possible in the last years to build, more and more potent storage system that server thousands of nodes. However, software parallelism that can be more effectively exploited by the current hardware is the key issue for the emerging bottlenecks and system scalability due to the enhanced storage requirement. Currently, the majority of storage systems of clusters are managed by kinds of scalable parallel file systems such as, for example, GPFS, PVFS, Ceph, Lustre and PanFS etc. And those storage solutions have mostly been adopted by the list of World’s Top 500 Supercomputers. So this dissertation mainly focuses on the architecture and realization methodology of a highly efficient parallel file system. As the aim of supporting high perfor-mance computing(HPC) well, next steps research includes interface semantics extension for the optimized performance to meet I/O requirement of HPC, then layout-aware approaches for optimizing parallel jobs’I/O pattern to adapt with data intensive scalable computing, cases study on the coupled problem on parallel file system and computing framework, and issues on redundancy coding in paralle file system etc.The design and implementation of a massive object-based parallel file system, named CapFS, bring several characters towards the proposed prototype. It has customized data distribution strategies, remote direct data access capability based upon object-based stor-age(OSD) protocol and power of persistent data management in transaction. In detail, the proposed nested-RAID scheme, as the uniformed model and algorithm of data layout, pro-vides a way to enable client-driven layout computation and maintains a consistent notion of a filc’s layout that provides POSIX semantics without restricting concurrent access to the file. Given the flat namespace service and scalable attribute management in OSD profile, a kind of mini database manager in kernel combined with local file system was proposed to take care of highly efficient object-based access and persistence management, and be fit for the differential service between objects in variable size and well-structured attributes. The machanism of OSD over RPC offers clients direct object-based storage access towards the available and shared-everything OSDs, also supports protocol negotiation among multiple storage transfer semantics mismatch. The tunnable parameters and testing results of the whole system both verified the effect and good scalability.So much evidence and analysis that the traditional POSIX interface can not afford to support HPC parallel applications whose I/O access pattern often consist of acesses to a large number of small, non-contiguous pieces of data. Those parallel applications lead to interleaved file access patterns with high interprocesses spatial lacality at the I/O nodes and high metadata throughput. Extensions are needed so that high-concurrency and high-performance computing applications running on top of the rapid prototyping parallel file system could perform well. So four types of interface extensions were presented to make storage I/O semantics match the upper applications. There arc shared file descriptor for con-current I/O, non-contiguous I/O oriented optimization, lazy and bulk metadata operations and layout control based on keeping POSIX semantics. Those subset of POSIX I/O inter-faces were deployed on the clusterd and high-speed interconnected file system. In addition, experimental results on a set of micro-benchmarks confirm that the extentions to the popular interface greatly improve scalability and performance than traditional methods.From bottom-up perspective, and takes the popular parallel computing framework as example. It can be easily found that the drawback of serious mediate data copy and commu-nication cost are caused by the semantics mismatch between exsiting I/O model and parallel computing framework. Compared with the difference between traditional distributed file system, the proposed layout parameterized by I/O-aware information helps to implement MapReduce computing framework over CapFS. I/O benchmarks and real application test-ing demenstrates such kind of parallel computation could execute upon the above parallel file system, in which the parallel I/O with several optimized and locality-aware functional-ities could be more feasible and flexible to the requirement of shipping code to data than Hadoop distributed file system. Among the three kinds of applications including computa-tion intensive, I/O intensive and both intensive, the proposed scheme could improve much more speed-up ratio for I/O intensive applications.For another, towards top-down perspective, a kind of erasure code was implemented by the parallel computing framework to provide better reliability and availability. This solution enables asynchronous compression of initially triplicated data down to RAID-class redun-dancy overheads, and those algorithms implementation based on MapReduce framework. Based on the algorithm, CapFS has implemented a redundant data management framework, which supports redundancy in different level including inter- and extra files, multiple user groups and devices level. Quite contrary to most exsiting solutions, in which the parity data is created in client side and transported in bind from clients to servers, vice the versa. The proposed redundancy method suggests an asynchronous way and totally transparent to clients’runtime, parity computation and loss recovery could be also recognized as parallel processing procedures. The experimental results come from metadata trace of Yahoo clus-ters, and demonstrated the efficiency of proposed algorithm and framework respectively.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络