节点文献

增值业务的概率故障定位

Probabilistic Fault Localization for Value-Added Services

【作者】 张成

【导师】 廖建新;

【作者基本信息】 北京邮电大学 , 计算机应用技术, 2010, 博士

【摘要】 随着IN (Intelligent Network)、3G/IMS (Third Generation/IP Multimedia Subsystem)、NGN (Next Generation Network)等通信技术的演进、成熟和推广,通信网络的业务提供能力不断提高,出现了越来越多的增值业务。这些业务给网络管理和运营维护提出了更高的要求。业务不可用和服务质量下降等故障不仅会造成运营商的经济损失,还会引起用户忠诚度降低甚至客户流失。故障诊断是保障增值业务高可用性、高可靠性和服务质量的关键技术。其中,故障定位在很大程度上决定了故障诊断的效率和效果,是故障诊断的核心技术。深入研究适合于增值业务的故障诊断技术,特别是故障定位技术,具有重要的现实意义和研究价值。传统故障诊断侧重于对设备和网络等资源的故障检测和定位,其关注的是各种设备的运行状态和网络的连接状况。这些资源与业务的依赖关系、影响范围、关联强度均没有纳入其研究范围。业务故障与传统网络故障有很大的不同:(1)业务故障的建模困难。相对于传统故障管理中的资源故障建模,业务的多样性、动态性、抽象性、依赖性和多域性使故障建模更加复杂;(2)业务故障的原因复杂。除了网络、平台、软件等原因,还可能有人为的原因,此外还有域间故障;(3)业务故障的范围更大。电信业务的高可靠性、高可用性和可运营性,使得业务故障不仅要包括业务功能类故障,业务性能类故障,还包括业务支撑类故障;(4)业务故障具有非确定性,识别和定位困难。很多时候需要参考上下文环境综合判断才能得出业务运行状态,业务的服务质量缓步下降时尤为如此;(5)业务故障对用户的直接影响要远大于资源故障,用户敏感性强,这就对业务故障定位的效率和效果提出了更高要求。本文以近年来迅速发展起来的电信增值业务为研究对象,以降低增值业务故障定位的复杂度,提高故障定位的检测率、降低误检率、减少故障定位时间,提高故障定位效率和效果为具体目标,围绕着增值业务运行时故障定位的关键技术进行研究。本文对研究过程中取得的主要创新成果进行了详细阐述。简要归纳如下:(1)在传统故障模型中,很少为资源与业务的关联建模,没有考虑资源与业务的依赖关系、影响范围、关联强度。为此,提出两种故障建模方法:基于统计和数据挖掘的浅知识的建模方法(Statistics and Data Mining, SDM)和借鉴叠加网络思想并结合端到端的业务提供的建模方法(Overlay Network and End-to-End Service Provisioning, ONEE)。ONEE方法包括业务组件间的水平故障建模和业务组件与资源组件间的垂直故障建模。SDM和ONEE弥补了增值业务和故障诊断系统的间隙,可以准确、便捷地为增值业务进行故障建模。(2)最优的概率故障定位已经被证明为NP-hard问题,很难应用于大规模、实时的增值业务。针对增值业务的故障定位需求,以概率加权二分图为故障传播模型,借鉴贪婪思想,提出了一种高效的启发式概率故障定位算法BSD (Bayesian Suspect Degree)。不同于现有的以最小集合覆盖为基础的启发式故障定位算法,BSD采用有效增量覆盖的方式,减小了误选故障的可能性。对算法的分析和仿真验证了BSD算法具有较高的效率和较好的定位效果。(3)大多数现有的故障定位算法都采用时间窗口的告警观测方式。然而在实践中时间窗口的大小很难准确设定。而不恰当的时间观测窗口,常常会明显降低故障定位算法的性能。针对此问题,提出了一种事件驱动的非确定性的增量故障定位算法IBSD (Incremental Bayesian Suspect Degree)。IBSD能够消除基于告警观测窗口方式故障定位的缺点。仿真实验表明,该算法要优于现有的IHU (Incremental Hypothesis Update)算法。(4)尽管BSD和IBSD算法具有一定的健壮性,但是由于其没有针对征兆丢失、征兆虚假等噪音环境提出解决措施,因而在存在大量噪音时,算法的性能下降较多。因此,提出可用于噪音环境下概率故障定位的MICAS (Minimum Interactive Checking with Adaptive Strategy)算法。通过引入增强型的评估函数、最小交互探测机制和适应性门限设置策略等三种机制,MICAS算法在征兆丢失率和虚假率较大的环境下,依然可以获得非常理想的故障定位效果。(5)虽然事件驱动方式的故障定位算法可以消除告警观测窗口对于故障定位准确性的影响,但是这种方式的故障定位效率较低,很难处理大量并发征兆。而且,征兆积累到一定程度之前的定位结果也没有实用意义。考虑告警观测窗口的同时,还要兼顾算法效率。因此,提出一种基于带有预处理机制的滑动窗口的增量故障定位算法SWPM (Sliding Window with Preprocessing Mechanism)。仿真实验的结果验证了SWPM算法的有效性。

【Abstract】 With the advances of IN (Intelligent Network),3G/IMS (Third Generation/IP Multimedia Subsystem), NGN (Next Generation Network), the capability of service provisioning of communication networks has been greatly improved, emerging more and more value-added services which pose new challenges for network management and OAM (Operation, Administration and Maintenance). Unavailable services and poor QoS (Quality of Service) make not only the loss of revenue but also degradation of customer loyalty and even the loss of customers. Fault diagnosis is a key technology to ensure high availability, high reliability and quality of service. Fault localization, as a central element of fault diagnosis, determines the efficiency and effectiveness of fault diagnosis to a large extent. The study of fault diagnosis techniques for value-added services, especially fault localization techniques, is really important for both industrial application and academic research.Traditional fault diagnosis focuses on the detection and localization of the faults in devices and networks, which pays attention to the status of devices operations and network connections and fails to consider the relationships, such as causality, the way of impact, the strength of dependency between resources and services. Service faults have much difference from traditional faults:(1) modeling service faults is more difficult. Compared with resource fault modeling in traditional fault management, service fault modeling is more challenging because of its diversity, dynamics, abstractness, dependences, and multi-domain characteristic; (2) the root causes of service failures are more complicated. There are often user reasons arousing the faults besides network, platform, software, etc.; (3) the scope of service faults has been extended. High availability and operations of services make service failures including not only function faults and performance faults, but also support (assistant function) faults and inter-domain faults; (4) non-deterministic status of service fault is usually difficult to recognize. It is often judged the status of service operation by the context and ambient, especially when service quality degrades gradually; and (5) the impacts of faults on users are greater than those of resource faults. The sensitivity of users imposes more challenges on the efficiency and effectiveness of service fault localization.This dissertation takes the emerging value-added service as a research object, aims at reducing the fault localization computational complexity, improving the accuracy of fault detection, shortening the fault localization time, and improving the efficiency and effectiveness of fault localization, and focuses on the key technologies for fault localization of runtime value-added services. This dissertation describes the details of innovations in the research, which are listed as follows:(1) Traditional fault models often lack the relationships between resources and services and do not consider the dependencies, the way of impact, the strength of dependency. Therefore, we propose two modeling approaches:fault modeling based on Statistics and Data Mining (SDM) and fault modeling inspired by overlay network and end-to-end service provisioning (ONEE). ONEE consists of two sub-methods: horizontal fault modeling within service components and vertical fault modeling between service components and resource components. They can make up the gap between value-added service and fault diagnosis system and generate the models for value-added service accurately and quickly.(2) Optimal probabilistic fault localization has been proven to be NP-hard and can hardly be applied to large scale, real-time value-added services. Considering the requirements of probabilistic fault localization for value-added services, we present a heuristic fault localization algorithm called BSD (Bayesian Suspect Degree) based on probabilistic bipartite graph and greedy idea. Different from existing algorithms based on minimum set cover problem, BSD takes a way of valid incremental coverage, which can mitigate the likelihood of false selections of faults. Analysis and simulations demonstrate the efficiency and effectiveness of BSD.(3) Most existing algorithms depend on the symptoms in certain time windows. However, they cannot determine the accurate size of time windows in reality. Usually, improper time windows may decrease the performance of fault localization algorithms obviously. Due to the limit of time windows in OAM practice, we develop an event-driven incremental probabilistic fault diagnosis algorithm called IBSD (Incremental Bayesian Suspect Degree):IBSD can overcome the drawback of inaccurate time windows of fault localization. Simulations show that IBSD outperforms existing IHU (Incremental Hypothesis Update).(4) Although BSD and IBSD are effective even in the presence of slight noise, the algorithms become degradable when facing much noise due to no special consideration for robustness. Thus, based on BSD, we present an algorithm called MICAS (Minimum Interactive Checking with Adaptive Strategy). Through enhanced evaluation function, minimum interactive checking, and setting thresholds adaptively, MICAS obtains an excellent performance of fault localization in the presence of a large amount of lost arid spurious symptoms.(5) Event-driven fault localization algorithms can eliminate the effect of inaccurate symptom observed windows, but the algorithms are inefficient and hard to deal with large amount of concurrent symptoms. What is more, deficient accumulated symptoms often lead to a wrong judgment, which is useless for network operators. We need to consider not only the observed window but also the efficiency. Therefore, we present a fault localization algorithm based on sliding window with preprocessing mechanism (SWPM). Simulation results demonstrate the validity of SWPM.

节点文献中: