节点文献

面向多核多线程的BGP协议并行技术研究

Research on BGP Parallelism Technologies for Multicore and Multi-threading

【作者】 高蕾

【导师】 龚正虎; 刘亚萍;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2009, 博士

【摘要】 为了适应超大规模超高速信息网络的需求,下一代Internet应具备安全性、实时性、可管理性、可用性和可扩展性等特点,而核心路由器作为构建Internet网络的重要基础设施,其现有硬件体系结构及协议软件结构已无法满足上述要求。特别是,随着Internet规模的快速增长以及各种网络应用的大量涌现,路由系统中关键域间路由协议BGP(Border Gateway Protocol)逐渐面临着性能与多维可扩展性等方面的严峻挑战。多核处理器具有较强计算能力、支持多个层次的并行性、拥有更高存储与I/O带宽、易扩展以及功耗低等巨大优势,为解决BGP协议所面临的问题提供了广阔的研究空间,成为构建下一代Internet网络的基石。因此,研究BGP协议如何充分利用多核处理器在结构与性能方面的优势来满足未来Internet需求,为下一代互联网基础设施建设提供理论基础和技术支撑,具有非常重要的实践意义。本文面向多核处理器平台,以BGP协议性能优化为重点,展开BGP协议线程级并行技术方面的研究。首先,通过分析BGP协议行为与潜在的并行特征,提出了两种线程化BGP(T-BGP)协议并行软件结构:多实例T-BGP与多任务T-BGP,并给出了它们的体系结构设计;继而,重点研究了多任务T-BGP协议中的推测线程划分,以及多实例T-BGP协议中的路由表并行访问与线程间路由通告技术;最后,在多实例T-BGP协议基础上完成了T-BGP原型系统的设计与实现。本文取得的研究成果主要包括以下几个方面:1、提出了两种T-BGP协议并行软件结构:多实例T-BGP协议与多任务T-BGP协议。多实例T-BGP协议以邻居会话划分为基础,由一个管理线程和一组BGP实例线程组成,利用数据并行思想将邻居会话分布在不同BGP实例线程上并行处理。该并行结构具有性能优化效果好、多维可扩展性强、路由选择实时性高等优势,程序开发设计难度较低,且易实现线程间的负载均衡,然而,其运行效率的提升还取决于对路由表并行访问冲突与线程间路由通告阻塞等关键问题的有效解决。多任务T-BGP协议以缓解BGP协议运行瓶颈为目标,采用了任务并行方式,将协议执行核心的更新报文处理过程分解在解析线程、路由更新线程和封装线程上并行执行,而其余功能仍由管理线程处理。该协议旨在挖掘程序内部的任务并行性,其并行性能的发挥取决于线程划分与线程调度等关键因素。2.提出了一种基于局部推测的线程划分技术,挖掘多任务T-BGP协议中更细粒度的线程并行来有效提高协议并行度。通过分析多任务T-BGP协议各功能线程中潜在的可推测性,采用基于最小割的启发式算法对各功能线程进行独立的局部推测线程划分。同时,还提出相应的线程推测策略来实现多线程并行执行,采用推测提交策略和写后读相关检测机制确保了线程推测执行与程序运行结果的正确性。最后,完成了基于局部推测技术的多任务T-BGP协议的功能验证与性能模拟,实验结果显示,该方法具有线程划分效果好且计算复杂性低的特点,采用局部推测技术之后显著减少了各功能线程的执行时间,获得了较好的性能提升。3.提出了一种高效路由表并行访问技术,有效解决了多实例T-BGP协议沿用传统路由表结构带来的路由表并行访问冲突问题。该方法将传统路由表重新组织成为一种动态可重构路由表结构,提出了细粒度动态拆分重组算法,周期性地动态调整路由表结构,根据路由节点访问频度来划分子树进而重组为多个子树集合。动态可重构路由表通过两级访问、局部子树加锁等途径支持不同线程并行访问路由表,有效克服了串行访问路由表的性能瓶颈。同时拆分重组算法通过均衡各子树集合上的访问负载,使得路由表并行访问冲突降低了92.5%。另外,还修改了路由表操作以支持简捷的访问行为。实验结果显示,在三种线程数量配置下,单线程最大路由更新耗时较使用传统路由表时平均降低了约48.7%、50.5%与71.1%,有效缓解了路由更新阶段的性能瓶颈。4.提出一种适用于多实例T-BGP协议的无阻塞线程间路由通告技术,克服了采用传统基于锁机制共享队列的路由通告方式所带来的大量共享访问冲突问题。在该方法中,每个线程为所有邻居各维护一个邻居播报队列,将更新后路由仅写入本线程所有播报队列;在获取播报路由时,各线程将根据本地处理的邻居会话标识从所有线程匹配的邻居播报队列中读取路由信息,避免了各线程竞争访问邻居播报队列。同时每个邻居播报队列都设计为SCLF(Speedy Concurrency Lock-free FIFO)无锁队列,实现了线程间无阻塞快速路由信息传播;还设计了带cache乒乓缓解的路由播报过程以降低因cache数据颠簸所造成的cache失效开销。实验结果显示,该方法使得线程间路由通告时间平均降低了约79.4%,设计的SCLF无锁队列显著改善了队列操作时间,较Lamport队列的操作耗时减少了56.5%左右。5、在多实例T-BGP协议的基础上,完成了T-BGP原型系统的设计与实现,并重点阐述了主线程与从线程设计的实现细节。该系统应用了动态可重构路由表技术与无阻塞线程间路由通告技术,较多实例T-BGP协议能够提供更加高效的并行处理能力。继而,还对T-BGP原型系统的路由学习时间、eBGP邻居切换速度、CPU使用率及系统加速比等关键指标进行了测试与分析,实验结果显示,与传统BGP协议相比,主要性能指标均得到了明显改善。综上所述,本文的工作针对BGP协议性能优化提出了有效的解决方案,对于推进新型域间路由技术的理论研究和实用化具有一定的理论价值与应用价值。

【Abstract】 To satisfy the requirement for the ultra large-scale and high-speed information network, the next generation Internet should possess the characteristics of security, real time, manageability, availability and scalability. For the limitations of hardware and intricate software architecture, the core routers which are the most important infrastructure on Internet cannot satisfy the requirements above. Especially, with the scale growth of Internet and the emergence of a large number of applications, the critical BGP (Border Gateway Protocol) in the interdomain routing system, gradually faces the challenges of performance and multi-dimensional scalability. Multicore processors have the advantages of powerful computing ability, multi-level parallelism, higher bandwidth for memory and I/O, and they are easy to scale and consume less energy, thus providing a broad research platform as the basis of the next generation Internet. Therefore, it is of great significance for practice to research how BGP utilizes the advantages of architecture and performance on multicores to meet the demands of Internet in the future, so as to provide the theoretical basis and technical support for the construction of the next generation Internet infrastructure.Towards multicore platform, the performance optimization of BGP is focused and the research of thread-level parallelism for BGP is carried out in this dissertation. Firstly, by analyzing the parallelism underlying in the behavior of BGP, two parallel architectures of Threaded BGP (T-BGP) including multi-instance T-BGP and multi-task T-BGP are presented along with their architecture designs. This dissertation emphasizes to address the issues of the parallel access to routing table and route propagation among threads in the first architecture and then the speculative thread partition in the second one. At last, the T-BGP prototype system is designed and implemented. The contributions of this dissertation are as follows:1. Two thread-level parallel architectures including multi-instance T-BGP and multi-task T-BGP are proposed on multicores. Based on the neighbor session division, the multi-instance T-BGP is composed of a manager thread and a group of BGP instance threads. With the idea of data parallelism, this architecture distributes the neighbor sessions on different BGP instance threads to achieve the parallel processing. Multi-instance T-BGP has the advantages of good optimization effects for performance, strong multi-dimensional scalability, high real-time for route selection, etc., and it is easy to implement the load balance and has low design complexity. But the key issues of parallel access to routing table and route announcement among threads both of which decide the performance of the protocol are still needed to be resolved. On the other hand, aiming at releasing the performance bottleneck of BGP, the multi-task T-BGP employs the way of task parallelism, and divides the update packet processing which is the core procedure of BGP into three parallel function threads involving parsing thread, route updating thread and packing thread, leaving the remaining functions in the master thread. This architecture could explore the parallelism inherent in the protocol, but the critical issues of thread partition and thread scheduling are still to be addressed to further improve its performance.2. Coarse-grain task parallelism limits the performance improvement of the multi-task T-BGP, and exploiting its finer-grain parallelism becomes the major method to yield better parallel efficiency. Thus, with the analysis for potential speculation of the function threads in multi-task T-BGP, a thread partition technology based on local speculation is proposed. It presents the min-cut heuristic algorithm to divide the speculative threads locally in each function thread, and also put forwards the static speculative strategy to achieve the thread speculation, making threads work in parallel. Additionally, the ordinal commit mechanism and the dependence detection strategy ensure the validity of thread speculation and program results. At last, the multi-task T-BGP based on local speculation is preliminarily implemented, and the functional verification and performance evaluation are carried out. The experimental results show that our method has the characteristics of good partition effects and low computational complexity, and greatly reduces the runtime of all fuction threads by using local speculation, thereby yielding good performance improvement.3. A mass of contentions by racing to access the routing table are brought in the multi-instance T-BGP, if still using the traditional routing table. Thus, this dissertation proposes a highly-efficient parallel access approach, in which a novel dynamic reconfigurable routing table is presented to achieve the fast parallel table access. In this approach, a heuristic-based divide-and-recombine algorithm is devised to partition routing table into subtries according to the real-time access frequency of table nodes, and then reorganize the subtries into several sets. And the algorithm is periodically performed to dynamically regulate the table structure. The main advantages of the dynamic reconfigurable routing table are concentrated on three aspects as follows. Firstly, the parallelization of route update by multi-threading is achieved through two-phase trie access by locking subtrie, so as to efficiently break the restriction of sequential table access. Secondly, the proposed divide-and-recombine algorithm significantly reduces the parallel access contentions by balancing accesses to subtrie sets. Thirdly, the table operations are effectively optimized to maintain the consistency with the behaviors of traditional routing table, attaining the higher rate of route update. The experimental results show that, when compared to traditional routing table, the maximal update time per thread is decreased by nearly 48.7%, 50.5% and 71.1% with the dynamic reconfigurable routing table under three thread configurations respectively, efficiently releasing the performance bottleneck of route update.4. A non-blocking route announcement method is put forward for the multi-instance T-BGP, thereby eliminating a great number of contentions which are incurred by the common method with the lock-based shared queues. In our scheme, each thread maintains an advertising queue for each neighbor with announcing the locally updated routes, and then gets the routes from the advertising queues of all threads with matching its local peer, thus avoiding the conflicts of accessing the same advertising queue by different threads. Additionally, each advertising queue is designed as the SCLF to significantly enhance the rate of operating the shared queues, and achieves the non-blocking route propagation among threads. At last, the route announcing process with cache ping pong mitigation is also presented to reduce the miss overhead by cache thrashing. The experimental results show that the runtime of route announcement among threads are reduced by 79.4% on average and the SCLF decreases the operation time by nearly 56.5% compared to Lamport lock-free queue.5. Based on the multi-instance T-BGP architecture, the T-BGP prototype system is implemented focusing on the details of the master thread and the slave one. With the key technologies of parallel access to routing table and non-blocking route announce- ment among threads, T-BGP could provide high-efficiency parallelism in comparison with multi-instance T-BGP. Finally, the experimental results for the key metrics of route learning time, the switch rate of eBGP neighbor, the CPU utilization and the system speedup show that T-BGP yields good performance improvement compared to BGP.To sum up, we present well-evaluated solutions in this dissertation for some key issues of BGP performance optimization. We believe that our contributions make a nice groundwork for future research and engineering on the interdomain routing protocol optimization both in theory and practice.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络