节点文献

流特征下的在线知识发现研究

Online Knowledge Discovery with Streaming Features

【作者】 俞奎

【导师】 吴信东; 王浩;

【作者基本信息】 合肥工业大学 , 计算机应用技术, 2013, 博士

【摘要】 相对于静态特征空间下的在线知识发现算法,动态特征空间下的在线知识发现算法的研究并没有引起足够的关注。数据特征空间的动态性是指在学习算法开始之前,问题的特征空间不是或不能事先给定,而是随时间动态变化。因而,数据特征空间的动态性对传统的静态特征空间下的在线知识发现算法提出诸多新问题和新挑战。为探索动态特征空间下的在线知识发现问题,本文提出了流特征(Streaming Features)的概念来建模特征空间的高维度和动态性而无需在学习算法开始前给定问题的全部特征空间。流特征定义为样本(示例)空间不变的条件下,问题的特征空间中的特征随时间逐个流入,且每个新流入的特征被立即在线处理。以流特征概念为核心,本文开展了流特征下的在线知识发现算法研究,主要取得了如下创新性成果:(1)针对流特征下的在线特征选择问题,本文提出了两种新颖的在线特征选择算法OSFS(Online Streaming Feature Selection)和Fast-OSFS。实验表明,OSFS和Fast-OSFS不仅能很好的处理训练数据特征维数信息预先未知条件下的特征选择问题,而且比已有的在线特征选择算法用较小的特征子集获得更高的分类精度。(2)针对流特征下的局部因果关系发现问题,本文提出了流特征下的局部因果关系发现算法(CDFSF, Causal Discovery From Streaming Features)。为提高CDFSF的效率,在忠实性因果贝叶斯网络条件下,利用父结点(原因)和子结点(结果)之间的对称性,提出了S-CDFSF (Symmetrical CDFSF)算法。与已有的局部因果发现算法(算法开始前需要训练数据的整个特征空间)的实验比较结果表明,我们提出的算法可以有效处理流特征下的局部因果发现问题。(3)针对流特征下的新兴模式(Emerging Patterns,以下简称EP模式)的挖掘问题,我们的研究工作分为以下两个方面。首先,基于因果贝叶斯网络中的因果相关性,本文提出了两种EP模式分类器:CE-EP和MB-EP来解决静态高维特征空间下的EP模式挖掘问题,CE (direct Causes and direct Effects)表示直接因果,而MB (Markov Blanket)表示马尔科夫毯。进而,针对CE-EP和MB-EP分类器不能处理动态高维特征空间下的EP模式挖掘问题,本文进一步提出了流特征下的在线EP模式挖掘算法,EPSF(mining Emerging Patterns using Streaming Feature selection)算法。实验表明,CE-EP和MB-EP可以有效解决静态高维特征空间下的EP模式挖掘问题,而EPSF算法不仅能够处理静态高维数据下的EP模式挖掘问题,而且可以有效的处理动态高维特征空间下的EP模式挖掘问题。(4)以火星图片上的陨石坑自动检测为应用研究,本文提出了流特征下的在线陨石坑检测算法,实验验证了本文所提出的流特征下的在线学习算法(包括OSFS、Fast-OSFS、 CE-EP和EPSF算法)在实际应用问题处理的有效性。

【Abstract】 Compared to the traditional online knowledge discovery with a static feature space, online knowledge discovery with a dynamic feature space has not attracted much attention. A feature space is dynamic when not all features are available before learning begins or when the feature space changes dynamically over time. Therefore, a dynamic feature space might make the feature space of traning data become high dimensional and uncertain, which is challenging for traditional online knowledge discovery algorithms.In order to explore online knowledge discovery with a dynamic feature space, we define the concept of streaming features to model high yet dynamic feature dimensions without the necessity of a whole feature space before learning starts. With streaming features, the features flow in one by one and each feature is online processed upon its arrival while the number of instances is fixed. With streaming features, in this dissertation, we study online knowledge discovery with a dynamic feature space and our main contributions are as follows.(1) We propose a new online feature selection framework for applications with streaming features where the knowledge of the full feature space is unknown in advance. With this framework, we present a novel Online Streaming Feature Selection (OSFS) method to select strongly relevant and non-redundant features on the fly. An efficient Fast-OSFS algorithm is proposed to improve feature selection performance. Experimental results demonstrate that the algorithms achieve better compactness and higher prediction accuracy than existing streaming feature selection algorithms.(2) We study a new research problem of discovery of local causal relationships in the context of streaming features. With a causal Bayesian network to represent causal relationships, we propose a novel algorithm, called CDFSF (Causal Discovery From Streaming Features) to discover local causal relationships from streaming features. In order to improve the efficiency of CDFSF, using the symmetry properties between parents (causes) and children (effects) in a faithful Bayesian network, we present a variant of CDFSF, S-CDFSF (Symmetrical CDFSF). Experimental results validate our algorithms in comparison with the existing algorithms for causal relationship discovery.(3) Mining emerging patterns (EP for short) is a challenging issue in the context of streaming features. To address this challenging problem, we propose two EP miners for mining emerging patterns from a high yet static feature space, called CE-EP and MB-EP, where CE stands for direct Causes and direct Effects, and MB for Markov Blanket. To mine EPs from a high yet dynamic feature space, we present a novel streaming pattern mining technique, called EPSF (mining Emerging Patterns with Streaming Feature selection). Compared to CE-EP and MB-EP, EPSF can mine EPs from not only a high yet static feature space, but also a high yet dynamic feature space. Extensive experiments on a broad range of datasets show the effectiveness of the CE-EP MB-EP, and EPSF classifiers against other well-established methods, in terms of predictive accuracy, pattern numbers, running time, and sensitivity analysis.(4) Also, we apply our proposed methods, including OSFS, Fast-OSFS, CE-EP, and EPSF, to a case study on automatic impact crater detection in real planetary images. Extensive studies reveal the advantages of our methods over existing streaming feature selection algorithms, crater detection methods, and well-known feature selection algorithms. Meanwhile, this case study validates our proposed methods on real-world data.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络